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the protein, so even the protein is not found in nature. The molecules (polypeptides) referenced 
in these claims are not found naturally but are contained as a component of a single large 
polypeptide (IMP dehydrogenase or IMPDH). Furthermore, reconstitution of the individual 
polypeptide fragments would not produce a system with the catalytic properties of IMPDH. 

The purity of the polypeptides(s) is not a relevant issue because the coordinates of claims 
5-7 were obtained from a crystal of 5. pyogenes IMPDH. Such crystals are not found naturally 
but must be generated using complex screening matrices. 

Claims 5, 6 and 7 are amended to relate "A crystalline molecule . . and there is no such 
product in nature, consequently the present invention is novel and non-obvious. 
IV. Those of Skill in the Art Would Understand What is Meant bv ^^Binding Pockef ^ 

The examiner admits that the specification "discloses" and defines "binding pocket," but 
still complains that the specification does not enable those of skill in the art "to predictably 
identify the binding pocket or the binding pockets from homologous sequences that are 60% or 
more identical." (Action, pages 3-4), 

The invention is enabled in the ability to design inhibitors specific for bacterial forms of 
IMPDH. The target region for the design of such agents is the active site region of the molecule. 
This region is identified in the specification on page 10, line 15 via the statement; "protrusions 
on the proximal face of the a/b barrel scaffold range in size from 3-67 residues and define the 
character of the active site." The inventors specifically identified the amino acid residues 
comprising the active site region. These specific residues are listed in claims 5 and 7, 

The basis for the involvement of these protrusions (loops) in defining the catalytic 
character of the bacterial enzyme is justified throughout the specification. Most notably, this 
information is contained on page 13-15 of the specification. This section describes structural 
features (active site helices H, J, and I) and specific residues involved in or contributing to 
catalysis. Further justification for the involvement of specific residues in catalysis is contained 
on pages 16-17 of the specification. This section describes the site-specific mutagenesis 
experiments in the active site region. 

As is clear from the specification, the term "binding pocket" is equivalent to "catalytic 
pocket," "catalytic site," and "binding site." The Wilson patent cited by the examiner indicates 
this term is known to those of skill in the art. The description of the catalytic site and references 
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contained in the sepcificaiton describe the interaction of carious ligands {e.g, IMP and NAD) 
with MPDH enzymes. For example all IMPDH enzymes have a cysteine at the active site and a 
consensus sequence for IMP binding (page 2 of the specification and reference cited herein). In 
view of this information it is reasonable to assume that the loaction of the binding pocket is 
generally conserved in all IMPDH enzymes thus fac characterized. Furthermore, IMPDH crystal 
structures indicate these enzymes all contain the same basic protein fold. The conservation of 
amino acid in sequence for bacterial IMPDH enzymes (Specification, Table 2), provides a 
method for the identification of amino acids corresponding to those residues found in the binding 
pocket of S. pyogenes IMPDH. 
V. Claim 6 is Not Too Broad 

The examiner wants to limit the claims to the crystal specifically disclosed in the 
application. The examiner questions whether less than 100% sequence identity predictably gives 
a sequence with the same biochemical activity. 

The examiner correctly asserts that in many cases it is impossible to predict the specific 
functional consequences of amino acid replacements. However, this is not globally true {e.g. it 
can be reliably predicted that replacement of the active site cysteine will have a substantial effect 
on function. The inventors describe consistent functional and sequence differences between 
bacterial and mammalian enzymes (specification page 2, lines 5-20; page 3, lines 2-11; page 5- 
8;and references, tables, and figures contained therein). The biochemical and kinetic differences 
between bacterial and mammalian enzymes are a consequence of the variance of specific, amino 
acid residues. Identification of these residues or combination of residues that impart this 
mammalian or bacterial enzyme signature is a prerequisite for the rational identification of agents 
that specifically target the bacterial enzyme. The invention enables the targeting of specific 
amino acids residues via site-directed mutagenesis or other means to elucidate the basis for the 
differences between bacterial and mammalian IMPDH enzymes. The percent homology 
definitions utilized in claim 6 represent the sequence diversity differential between bacterial and 
mammalian IMPDH enzymes. 

VL Format of Claims 1-8 is Amended Without Altering the Scope of the Claims 

Amendments are made to the format of claims 1-8. These amendments do not alter the 
scope of the claims. Amendments in a separately filed document are additions of SEQ IDs from 



I i 

Serial No: 09/533,466 -5- Attorney Docket No. 21416/90042 

the concurrently filed computer sequence listing. Amendments herein spell out some of the 
terms. For example, the abbreviation "IMPDH*' which is spelled out in the specification as 
inosine monophosphate dehydrogenase, and which only has one meaning, is spelled out in 
amended claim 1. The abbreviation term ''MP" is spelled out in the specification and only has 
one meaning (inosine monophosphate). Claim 8 is amended to also spell this term out. Because 
claims are interpreted in light of the specification, all abbreviated terms were definite to begin 
with. Amendments have not altered the scope of the claims. 

Claim 4 adds on a step, which just repeats the preamble, so it does not further limit the 

claims. 

A variety of molecular docking programs are available (two are provided herein as 
Exhibits A and B) to enable those skilled in the art to screen compounds to identify potential 
inhibitors. These programs are a common component of most structural biology software (e.g. 
Insight n from Molecular Simulations Inc.). Most of the newer programs provide for automated 
docking of ligands to receptors in the structure-based drug design process. A variety of purine 
analogues and mammalian inhibitors have been reported in the literature and would be likely 
candidates to begin a search for specific inhibitors of bacterial IMPDH enzymes. However, 
universities (e,g. University of Georgia) and commercial entities (Vertex Pharmaceuticals) have, 
and are continuing to construct, small molecule digital libraries. The present invention provides 
the structural coordinates of bacterial IMPDH that enables the screening of these candidate 
compounds/digital libraries. 

Claims 5, 6, and 8 relate "binding pocket" which is clear. 

"a computer generated representation of a molecule or molecular complex 
wherein said molecule or molecular complex comprises a binding pocket defined by the 
structural coordinates of 5. pyogenes IMPDH amino acids 

SEQ ID numbers are filed concurrently so the amino acid numbering of claim 7 is clear. 
The amino acid numbers of claim 7 refer to Table 7. 

Claim 7 is amended to insert the definition of "coordinates" from the specification. 
Those of skill in the art would understand the term without the amendment, and the amendment 
does not alter the scope of the claim because it merely recites the definition already in the 
specification. 
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Claim 9 is added. 

VIL Zhang et al^ 1999, Is Prior Work of the Present Invention Published Less Than a 
Year Before Present Filing Date Therefore is Not Anticipated 

Claims 1-8 were rejected under 35 U.S.C. § 102 (a) over Zhang et al 
Claims 1-8 are not anticipated by Zhang et al. A Declaration under 37 C.F.R. 1.131 from 
Dr. Collart is appended herein (Exhibit C) showing that the present invention was prior work of 
the present inventor published less than 1 year before the filing date of the present application. 
Moreover, as the examiner admits, not all elements of claims 1-8 of the present invention were 
taught by Zhang et al Therefore, for both reasons Zhang et al does not anticipate claims 1-8 of 
the present invention. 

Disclosure in a publication does not prove that any "invention" within the meaning of 35 
U.S.C. 102(g) has been made. An inventor's own prior work will not anticipate his later 
invention, absent a statutory bar. In re Katz 215 USPQ 14 (CCPA 1982), "a printed publication 
cannot stand as a reference under 102(a) unless it is describing the work of another." {Katz 215 
USPQ 14, 17). Authorship of a publication does not raise the presumption of inventorship with 
respect to subject matter disclosed in publication {Katz 215 USPQ at 18). Therefore, co-authors 
may not be presumed to be co-inventors merely from co-authorship, 
VIII. Sintchak Does Not Anticipate Claims 1-6 and 8 

Claims 1-6 and 8 were rejected under 35 U.S.C. § 102(b). The examiner admits that 

however, it should be noted that patentable weight 
is given to the IMPDH composition, since the 
distinguishing features of the composition are in the 
specific sequence , structure, and function of 
IMPDH and not the cloning vehicle from which it 
came... It should be noted that the diffraction 
patterns obtained by Sintchak et al. would be useful 
for defining molecular structures of similar bacterial 
EMPDHs through the use of such methods as 
molecular structures of similar bacterial IMPDHs 
through the use of such methods as molecular 
replacement. The molecule and molecular complex 
solved by Sintchak et al reads on a molecule 
comprising a homologue of the 5. pyogenes IMPDH 
of the instant invention. 

(Action, page 7) 
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Sintchak does not anticipate claims 1-6 and 8 because the bacterial enzymes and 
mammalian enzymes are different as described in the specification and Section IV herein. The 
structure of the Chinese hamster enzyme does not elucidate the basis for these differences. Only 
a crystal structure of a bacterial EMPDH enzyme can be used to delineate the basis for these 
differences. 

The S. pyogenes and Chinese hamster IMPDH enzymes show only a 35% identity. This 
level of homology did not enable interpretation of the 5. pyogenes model. A similar case was 
observed for the structure of IMPDH from Tritrichomonas foetus which shows a similar level of 
identity to the S. pyogenes and Chinese hamster enzymes. 

It should be noted that conditions for the generation of crystal in the present invention are 
different than those described for the Chinese hamster enzyme. 

IX. Collart et al. Does Not Anticipate Claims 5-6 

Claims 5-6 were rejected under 35 U.S.C. § 102(b) over Collart. 

Collart et al does not anticipate claims 5 and 6 because Collart (1996) does not teach 
elements that are the same as claims 5 and 6. Collart (1996) describes amino acid similarities 
between IMPDH enzymes but does not teach methods to obtain well diffracting crystals of 
IMPDH enzymes, as in the present invention. Although IMPDH from S. pyogenes and P. 
furiosus share 50% sequence identity, crystals of P.furiosus IMPDH suitable for structure 
determination have not been obtained in any laboratory. There are methods based on amino acid 
sequence alignments to predict proteins that will crystallize. Every crystallization laboratory 
utilized a matrix (48-1536 individual conditions) of solutions to screen for conditions that 
produce well diffracting crystals. Despite the availability of these screens, many proteins still do 
not produce well diffracting crystals suitable for structure determination. 

X. Wilson Does Not Anticipate Claims 1-6 and 8 

Claims 1-6 and 8 were rejected under 35 U.S.C. § 102(e) over Wilson et al. 

It should be noted that diffraction patterns obtained 
by Wilson et al would be useful for defining 
molecular structures of bacterial IMPDH through 
the use of such methods as molecular replacement. 
The molecule and molecular complex solved by 
Wilson et al. reads on a molecule comprising a 
homologue of the S. pyogenes IMPDH of the instant 
invention. 
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(Action, page 8) 



Wilson relates IMPDH Type H. Bacteria do not have this IMPDH, they only have Type L 
There is only 35% identity between the proteins which made it impossible to use the crystal 
model to solve the bacterial IMPDH molecule structures. 
XI, Computer Sequence Listing 

A computer sequence listing in accordance with 37 C.F.R. § 1.821-1.825 is filed 
concurrently. 
XIL Other Issues 

The examiner did not consider the Holbrook publication in the Information Disclosure 
Statement because the date was not on the forms submitted. 

The missing date for Holbrook in the Information Disclosure Statement is 1975. A copy 
of the publications was submitted, and the date was cited in the specification, therefore Holbrook 
should have been considered because the date of the publications was easily accessible in the 
record. 

XIIL Summary and Conclusion 

For the reasons stated above, applicant requests allowance of all pending claims. 

Please contact applicants' representative if you have any questions. 

No other fees are believed due at this time, however, please charge any deficiencies or 
credit any overpayments to deposit account number 10-0435 with reference to our attorney 
docket number (21416-90042). 



BARNES & THORNBURG 

2600 Chase Plaza 

10 South LaSalle Street 

Chicago, IL 60603 

(312)214-8316 

March 14, 2001 
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ABSTRACT: The DOCK program explores possible orientations of a molecule 
within a maciomoleculax active site by supeilmposinig atoms onto precomputed 
site points. Here we compare a number of different search methods, including 
an ^diaustive nnatching algorithm based on. a single docking graph. We evaluate 
the performance of eadi method by screening a small database of molecules to a 
variety of macromolecular targets. By varying the amount of samplings we can 
monitor the time convergence of scores arid rankings. We not only show that the 
site point-directed sear(±L is tenfold faster than a random search, but that the 
single gtfjph matching algorithm boosts the speed of database screening up to 
60-fold. The new algorLthm, in f acty outperfocms the bipartite graph iiiatchiiig 
algarithm currently iised in DOCK. The results indicate that a critical issue for 
rapid database screening is the extent to which a search method "biases run time 
toward the highest-ranking molecules. The single docking graph matching 
algorithm will be incorporated into DOCK version 4.0, © 1997 by John Wiley &c 
Sons, Inc. / Comput Chem 18: 1175-1189, 1997 

Keywords: automated molecular docking; 3D database search; molecular 
recognition; structure-based drug design 



Introduction 

With the advent of high resolution X-ray 
crystallography and NMR, structural 
chemists and biologists can study biomacromolec- 
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ular interactions in atomic detail. This information, 
combined with computational and visualization 
tools, has helped spawn the field of structure-based 
ligand design. A common step in the design cycle 
is the process of molecular docking, in which pos- 
sible binding geometries of a molecule with a 
macromolecule are studied. 

The docking process can be divided into two 
parts: a search algorithm and a scoring algorithm. 
The search algorithm shotald sample the degrees of 
freedom of the ligand;macromolecule system stiffi- 
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ciervtly to include dve true binding modfiCs). The 
scoring algorithm should represent the theimody- 
namics of interaction suffidently to distinguish the 
true bidding mode(s) from all others e^lored. 

Because of the computationally expensive na- 
ture of the search problem, many different solu- 
tions have been proposed. Docking with molecular 
dynamics and. Monte Carlo algorithms has been 
eacplored/'^ including simulated annealing^"^ and 
MCSS* methods. Other docking protocols consider 
molecular flexibility, including rotamer searchy^"^ 
distance geometry/'' and genetic algoiithm^^"^^ 
methods. To make the search tractable fot process- 
ing a large set of molecules, the moleciilar compo- 
nents are often treated as rigid objects. With this 
approximatioiv researchers have used systematic 
searching,^^ pattem recognitian,^"^^ &^T?^ theo- 
retical/®"^^ and other superposition.^ techniques 
to dock molecules. 

The UCSF DOCK program belongs to the group 
of methods employing the rigid body assximption 
and uses graph theoretical techniques. Because of 
its speed, the program is often used to screen a 
lajfge database of nriolecuLes, selecting potential 
Hgands of a receptor target^ In this artidfi, the 
term "ligand" is used loosely; it refers to any 
small molecule wiiose binding is under study. The 
term "receptor'' refers to the macromolecule whose 
binding pocket is being explored. 

The core of the DOCBC search algorithm is the 
supetimpositioii of ligand atoms onto predefined 
site points^ ^ that map out flie negative image of 
the bindixig site- A matching process is used to 
determine which ligand atoms and site points are 
to be superimposed.^ Multiple orientations are 
generated this way, with each receiving a score 
assessing the intermolecular interactions. This score 
is based on the intermolecular terms of a molecu- 
lar mechanics force field Kecently, an optimiza- 
tion procedure has been added that adjusts each 
orientation to improve the intermolecular interac- 
tions.^'^^ 

In this work, we critically evaluate several 
matching algorithms for the docking process, in- 
cluding an exhaustive matching algorithm. The 
exhaustive algorithm was presented by Bron and 
Kerbosch as a method to delect diques in an 
undirected graph^^ It was later incorporated into a 
docking algorithm by Crippen and coworkers."'^' 
It has many atteactive features, so we chose to 
evaluate it in the context of rigid molecular dock- 
ing, score optimization, and database screening. 
As an exhaustive search, it avoids some of the 
artifacts encoimtered by the current matching 



method. For example, with a nonexhaustive algo- 
rithm, adjusting parameters to increase the total 
amount of sampling can reduce the amoimt of 
sampling of certain binding modes.^ Increased 
sampling, with the new algorithm, will always 
retain binding orientations found with less sam- 
pling, leading to a proper superset of binding 
modes. An exhaustive algorithm also does not 
require the additional parameters controlling the 
heuristics of the nonexhaustive search.*^ 

Althougii the matching algorithms formally treat 
the ligand and receptor as rigid objects, they can 
readily be incorporated into a flexible docking 

scheme.^°'^^'^^"^^ In future work, we will investi- 
^te how best to divide up a flexible docking 
problem into smaller rigid parts, 

To evaluate the performance of the new match- 
ing algorithm, we propose a new assessment pro- 
tocol based on screening a small database of 
molecules. Because we specifically intend to mini- 
mi2:e any artifacts due to the quality of the scoring 
function in this work, we will not use experimen- 
tal measurements as the standard, but instead^ the 
global minimum of our current scoring function^ 
We will also evaluate random, and partially ran- 
dom, search algorithms as controls with which to 
put the current TXXIK performance in perspective. 
These control algonthrns let us investigate fimda- 
mental issues of orientational sampling, such as 
the effect of using site points to guide the search- 



Methods 

BIPARTITE DOCKING GRAPH 

Since the first release of DCOC the search pro- 
cess has been driven by a matching procedure in 
which subsets of ligand atoms and receptor site 
points are identified that have equivalent internal 
distances.^^ Matching is formulated as a graph 
theoretical problem in which the ligand atoms and 
receptor site points are separate sete of nodes in a 
bipartite graph.^ A match is defined as a set of 
compatible edges which connect a subset of ligand 
nodes with an equal number of receptor nodes, Por 
the edges to be compatible, the distances among 
ligand nodes must map to equivalent distances 
among receptor nodes. An example of match for- 
mation is depicted in Figure 1. As this figure 
illustrates, to extend a match, all possible edges 
(including bad edges) must be considered; dis- 
tance comparisons are used to identify and discard 
bad edges. Matches are extended tmtil there is a 
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sufficLerit number of nodes in the match to define a 
tmique onentation of the ligand. 

Since version 2.0, DOCK avoided considering 
some tad edges with a pruning mathod involving 
distance "binning-^ Nodes were preorganized in 
distance bins^ such that, for each seed node, sebs of 
nodes ia discrete distance intervals from the seed 
were identified- These bins guide match extension 
from a seed edge (connecting seed nodes), ensur- 
ing that candidate edges are compadble with the 
seed edge. They do not, however, ensure compati- 
bility with other noEOseed edges already included 
in the match. For example, the binning algorithm 
would avoid considering the bad edge in step II of 
Figure 1, but not the bad edge in step DI. The 
storage requirements for this algorithm grow as 
N^NjjN^yj, :^ N^, where 2^„ is the number of nodes 
Oigand atoms or receptor site points). Is the 
number of distance bins and grows with the longest 
distance and the inverse of the bin widths ^n/b ^ 
the number of nodes in each distance bin wldch 
grows with N„ width- 

SINGLE DOCKING GRAPH 

Kuhl et al, proposed merging tlie bipartite dock- 
ing graphs into a single docking graph,^^ which is 
then amenable to dique detection techniques de- 
veloped by Bron and Kerbosch.^ In a single dock- 
ing graph, each node represents a pairing of an 
atom with a site point. Each edge identifies adja- 
cent xuxies, or two nodes for which both atom 
components and site point components are sepa- 
rated by equivalent distances. The docking graph 
is represented by an adjacency matrix in which 
each nnnzero element identifies adjacent nodes. 
An adjacency matrix for the example depicted in 
Figure 1 is presented in Figure 2- The chief advan- 
tage with this representation is that all necessary 
distance comparLBons are made during the con- 
struction of the adjacency matrix Consequently, 
during matching, the adjacency matrix is used as a 
rapid filter to ensure that no bad edges are ever 
considered- This type of matching is presented in 
Figure 3. Although the single docking graph is one 
step removed from the intuitive appeal of the 
bipartite docking graph, it enables a more efficient 
solution to the docking probleiru 

The sin^e docking graph representation has 
also been implemented in Ihe FLOG docking pro- 
gram.^^ This program heavily prunes the matching 
search tree using a minimum-residual search 
heuristic. Although it examines all possible nodes 
at each branch point, it only pursues , the node with 
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FIGURE 1. Bipartite graph matching aigorrthm. The 
receptor srte points (A-£) and ligand atoms (l -4) are 
separate sets of nodes in a bipartite graph.^^ L A firert 
(seed) edge is considered. Of the 20 seed edges to be 
tried (5 points* 4 atoms), we first consider A4 for this 
eJcample, In three dimerrslons, such a match would 
superimpose atom 4 onto site point A. This match would 
fix three of sir orientational degrees of freedom. IL 
Second edges are considered. Of the 12 edges to be 
tried (4 points left*3 atoms iefti, we consider E3- 
Because AE > 43, we discard E3 as a second edge. 
Then we consider E1 . Because AE = 41 , we retain E1 as 
a second edge. In three dimensions, this match would 
superimpose atoms 4 and 1 onto points A and E, 
raspectivety. This match would fix two more orientational 
degrees of freedom. III. Third edges are considered. Of 
the six edges to be tried (3 pofnts left*2 atoms lefO. we 
consider C3. Though AC = 43, EC < 13 so we must 
discard C3 as a third edge. Then we consider B3. 
Because AB - 43 and EB - 13, we retain B3 as a ttiird 
edge. This match fixes the last of six orientational degrees 
of freedom- IV. The match is large enough to define a 
unique orientation which superimposes atoms 4. 1 , and 
3 onto site points A, E, and B, respectively. 



the smallest difference in ligand and receptor dis- 
tances with respect to the most recently added 
node in the match. Backtracking is only allowed at 
the seed levels where all possible nodes are pur- 
sued to initiate matching. As a result, the total 
number of matches can never exceed the number 
of nodes, 

Here we propose implementing the single dock- 
ing graph representation combined with a varta- 
tiOT. of the exhaustive clique detection method 
discussed hy Bron and Kerbosch,^ A clique is 
defined^ as a set of fully adjacent nodes G.e., a 
completely connected subgraph) which cannot be 
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FIGURE 2, Adjacency matrix for single docking graph 
algorithm. This matrix identifies all adjacent nodes for the 
example given in Figure 1- Each node la defined as a 
site point-ligand atom pair, for instance. A4. For two 
nodes to be adjacent, the Intra-atom distance must be 
equal to the intra-srte point distance- For example, matrix 
element (A4. E1) is turned on because AE=41- The 
matrix is symmetric. 



furiher enlarged without adding a nanadjacent 
node. Much attention is given to the intractable 
nature of the maximuin clique problem. It is classi- 
fied as NF-compIete because the solution time 
grows faster than any polynomial expression of 
the pnDiblem size.^^ For the application of molecu- 
lar docking, however, we are not trying to find the 
single, largest dique- The process of matching is in 
fact a process of finding completely cannected 
subgraphs within an undirected graph: a less re- 
strictive^ and therefore more tractable problem than 
finding dlques and maximum cliques. Althougfi 
Bron and Kerbosch actually present two methods 
and recommend a bounding technique for clique 
detection, we find their original, brute-force 
method sufficient for finding fully connected sub- 
graphs in a manner efficient for molecular dock- 
ing. 

MATCHI!VG PARAMETERS 

Jxx out molecular docking implementatidv we 
use two parameters to determine node adjacency: 
a distance tolerance and a distance minimum. The 
distance tolerance parameter addresses experimen- 
tal uncertainty in the ligand and target structures. 
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FIGURE 3. Single docking graph matching algorithm. 
This figure depicts the entire search path-starting from - 
the same seed used in Figure 1- Matching begins with 
the match set, Mq, empty and the adjacency set, /^q. 
maximally filled* A match is extended by finding the 
union of the previous match» Af^.-,, and a branch node. 

Which Is selected from v Tt^e new adjacency 
set, An» is the intersection of , wfih the set of ail 
nodes adjacent to b^, which is A^^ and is taken from the 
adjacency matrix presented in Figure 2. Match extension 
continues until A„ is empty. Matches with three or more 
nodes define a unique ligand orientation (see text and 
Rg. 4). The match set, {M B3 corresponds to the 
solution presented in Rgure 1. 



The distance minimum parameter enables the 
search to focus on the longer^ more relevant inter- 
nal distances. If adjacency iaformatiQn is stored in 
a matrix, then, for most docking situatLons, the 
matrix can be very large. The matrix size grows as 
(N|,^N„^)^ where Nj^^ is the number of nonhy- 
drogen ligand atoms and N^cc ^ the number of 
receptor site points. Because these matrices are 
sparse (ca. 1% elements typically occupied), we 
store only the nonzero elements of each row of the 
matrix as an integer list of nodes. Hie probability/ 
p^„/ of an element being nonzero is a function of 
the distance tolerance and distance minimum. The 
memory requirement for the adjacency lists is 

Pon^^iig^rcc)^' We presort each adjacency list so 
that the process of finding the common elements 
of two lists (the — A^.^^ n A^^ steps in Fig. 3) 



1178 



VOL 1B, NO. 9 



AUTOMATED MOLECULAR DOCKING AND DATABASE SCREENING 



can be performed an a once^through basis at a 
speed comparable to the use of the complete ma- 
trix. 

The advantage of an exhaustive algorithm is 
that, when sampling is increased, the search is 
guaranteed to include the search space explored at 
the lower sampling leveL This property helps to 
avoid sampling aiti farts encountered with the bi- 
partite matching algoritihrrL'^ Despite its exhaus- 
tive nature, it does nort undergo a combinatorial 
explosion for larger systems because of user conr 
trol over the sampling parameters. Typical sam- 
pling parameters for a docking scenario having 
fewer than 30 ligand lUDnhydrogen atoms and fewer 
than 50 target site points are: four nodes minimum 
for a matdv and 0.5-A distance tolerance and 2.0-A 
distance minimum for node adjacency. Because the 
search never explores invalid branches, search time 
grows as a function of the number of distance-con- 
strained solutions rather than the number of possi- 
ble unconstrained solutions. Therefore, docking 
larger molecules into larger sites can be made 
nearly as rapid if a smaller distance tolerance or 
larger distance minimum is chosen Because mem- 
ory reserved for the adjacency lists is dynamically 
allocated, the memory burden is adjusted as well. 

IVIINI!VIIM MATCH SIZE 

Some confusion exists in the literature over how 
many atoms and site points must be in a match to 
define a imique orientation. The orientation is gen- 
erated by a ligand transformation which has a 
translation and a rotation componenL The transla- 
tion vector has three degrees of freedom- Tbie rota- 
tion matrix has four degrees of freedom. Three of 
them are represented by the Euler angles. The 
fourth is represented by the sign of the determi- 
nanL A rotation with a positive determinant re- 
tains the handedness of the object it transforms, 
while one with a negative determinant will reverse 
the handedness of the object. If one knows in 
advance whettier to reverse the handedness of the 
object, then only the three Euler angles need to be 
determined, For example^ when docking a chiral 
ligand in which only one stereoisomer is relevant 
(e.g., protein or peptide ligandsX only the positive- 
determinant rotation matrix would be of interesL 
When docking a ligand available as a racemate, 
then both transformations would be of interest 
The FLOG program,^ for instance, routinely sam- 
ples both mirror images a ligand, even when the 
ligand is achiral. When the sign of the determinant 
is known in advance, the six degrees of freedom of 
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FIGURE 4, Re(atK/8 chlrality of match points. (A) 
Distance matching might identify a set of atoms {12 3 
4} to match with a set of site points {A B C D}, such 
that A is with 1, B is with Z, and so on- AJthough the 
internal distances within the atoms are equivalent to 
those within the site points, the handedness is opposite, 
(B) We define the relative chfrallty, O, according to the 
sign of the triple product For any given four-point match, 
the probability that the relative chirafrties are the same is 
50%. (C) tf the chirally opposed sets are superimposed 
without inverting the chirality of the ligand sat, then the 
resulting least squares fit is poor. 



the rotation and translation are uniquely deter- 
mined by a match set containing three itonUnear 
atoms and site points. 

Processing a larger match set causes only one 
transformation to be allowed. As illustrated in 
Figure 4, when a one-to-one mapping has been 
made between two sets of four nonplaxiar points, 
then each set can be assigned a relative chirality. 
This is true even if the points come from an achiral 
molecule or from a set of site points where chiral- 
ity is ambiguous. If the relative cfairalities are the 
same, then the ligand can be oriented normally. If 
the relative chiralities are opposite^ then either the 
ligand is inverted when oriented, or if that is not 
desired^ then the match is discarded. In fact, all 
larger matches that are supersets of the discarded 
match are also discarded, because they too yield 
inconsistent matches. If these steps are not taken, 
then the resulting orientations will poorly super- 
impHDse the ligand atoms and receptor site points 
in the match set, even thougih all the distance 
tolerances are met (Fig. 4C). 

ADDmoIVAL SEARCH CONSTRAINTS 

The systematic design of the matching algo- 
rithm makes it well suited to incorporate special^ 
ized search constraints. Some e?camples, al1i\ough 
not assessed in this study, are mentioned because 
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^heyr have been shown to he useful elsewhere. To 
avoid oversampling particular binding mcxles, oxi- 
entalional degeneracy checking has been shid- 
ied.^'''^ In ihe new matching algorithm/ a degener- 
ate orientation is detected as a degenerate match 
whose nodes axe a subset of nodes in a larger 
match- In other wordS/ only subgraphs that are 
true cliques need be processed- As another exam- 
ple/ chemical information can be used to guide the 
matching process using labeled atoms and site 
points. Qcily nodes composed of a chemically com- 
patible atom and site point are used to seed or 
extend a match. Much like the repellent node im- 
plementation of Kuhl et al./^ matches adjacent to 
chemically incompatible nodes are discarded. In 
additiOT, sampling can be focused on particular 
regions of the active site by defining critical site 
point clusters. This technique is similar to the 
approach used in targeted DOCK^ and FLOG/'^ 
except that clusters can be of arbitrary size and 
number. The matching process automatically re- 
stricts itself to make sure all matches include 
members from each duster. The new matching 

TABLE L 

Search Methods. 



algorithm lends itself so well to these constraints 
that when activated, they contiibutE a negligible 
computational overheads and can lead to consider- 
able speed improvements for database search- 



CONTROL METHODS 

We -will test a total of five methods (Table I) to 
isolate specific aspects of the search process- These 
methods range in complexity from a completely 
random search to the bipartite and single graph 
procedures described previously. We begia with 
the uniform random ttansf ormation CURT) method, 
which explores a predefined rectangular volume 
enclosing ttie active site. It is the most simple and 
"hypothesis-free" of the methods tested here. TJRT 
will indicate the minimum level of performance 
that we expect from any docking algorithm. The 
uniform random matching (Ul3*/D method ex- 
plores the irregularly shaped volume described by 
the collection of site points. It will show the perfois 
mance gains> if any, of using a "negative image" 



Abbrev. Method Description Hypothesis Tested 



URT 



URM 



BRM 



SGM 



BQM 



Uniform 
random 
tfBnsformation 



Unttomni 
random 
matching 



Biased 

random 

matching 

Single 
graph 
matching 



Bipartite 

graph 

matching 



CorLstoict rectangularvolume enclosing 
srta 

Randomly move molecule center of 
mass within volume 
Randomly rotate 

Each molecule In datattase sampled 
unrformly 

Match random subsets of atoms with 
random sutissts of site points 
Superimpose match atoms onto match 
site points with a least squares fit 
Each molecule in database sampled 
uniformly 

Match randomly Olte URM) 

SGM controls amount of sampling for 

each molecule in database 

Using single graph, exhaustively 
match subsets of stoms^with subsets 
site points with equJvalent internal 
distances (DOCK 4.0) 

Using bipartite graph, nonexhausitlvely 
match using binning algorithm 
(DOCK 3.5) 



• Random method used as reference 
« Rectangular enclosure is strfficlent 
Site point description Is unnecessary 



An inegular volume to describe the 
site Is more efficient 



An irregular volume is more efficient 
Spending more time on molecules 
which match- better Is more efftcienrt 

Frtring some atoms precisely orrto 
of some site points is more efficient 



Fitting some atoms precisely onto 
some site points is more efficient 
Binning algorithm more efficient 
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approach to map out the binding site. The "biased 
raildoiii matching CBRM) method is identical to 
UEMy except it uses the new matching algorithm 
to determine the number of random matches to try 
for each molecule. Once the number of matches 
has been determined^ BRM uses completely ran- 
dom selections of nodes to form the actual matches 
used to generate molecule orientations, Becaiase 
HRM is a hybrid approach^ it is meant to help 
isolate the source of any differences between UEM 
and the new matching algorithm. The single graph, 
matching (SGM) algorithm uses the new matching 
algorithm to determine both the number of matdies 
and the actual orientations to try for each molecule. 
SGM will reveal the performance gains, if any, of 
using site points to not only map out the most 
interesting binding site volume, but to also direct 
the positioning of individual ligand atoms. The 
bipartite graph matching (BGM) meihod is the 
existing DOCK 3.5 matching algorithm- It wiH re- 
veal the advantage^ if any, of using a nonexhaus- 
tive search method with a longest distarure first 
heuristic. BGM is described last because, within 
the spectrum of different search methods^ its algo- 
rithm is the most elaborate. 

TEST SYSTEM 

We assess the performance of the search meth- 
ods in the following way. We dock a set of 100 
molecules^ chosen randomly from the set of un- 
charged, medium-sized, and generally rigid 
molecules in the available chemicals database 
G^.CD).^ la our study, a medium-sized molecule 
is one with 15 to 35 nonhydrogen atoms. A gener- 
ally rigid molecule is one with no single bonds 
except those attaching hydrogen atoms, attaching 
terminal nonhydrogens (i.e., methyl or hydroxyl 
groups); or participating in ring structures. Mole- 
cules meeting these three criteria compose 40% of 
the ACD. For each molecule in the test set, a single 
CONCOED^-generated conformatian is used. 

We see several advantages to using such a data 
set of molecules to test the search methods. First, 
the docking conditions represent a close approxi- 
mation to the typical application of DOCK to 
database screening. Not only can we study the 
convergence of score for each molecule, we can 
study the convergence of relative scores, or rank- 
ings, of the set of molecules. Secondy the docking 
conditions allow us to explore a multitude of di- 
verse molecular shapes so that our results are less 
subject to potential artifacts of a particular ligand: 
receptor system- Although some may argue that 



studying a set of knowrv- potent Hgands would be 
more relevant, we counter that the databases 
DOCK searches often do not contain potent 
binders, and that DOCK frequently finds micromo- 
lar inhibitors to serve as lead compounds.^ By 
choosing a random subset of moleculeS/ we, in 
fact, will arrive at a set that best represents the 
typical array of molecules tested. By biasing the 
subset to include medium-sized, generally rigid 
molecules, we also focus on that portion of the 
database which is best treated by rigid molecular 
doddng, 

For each search method, we perform multiple 
docking runs, and vary the amount of sampling 
from zero to a value at which the docking results 
converge. The key sampling parameters for each 
method are listed in Table II, Scoring and opti- 



IL 



Sampling Parameters, 



URT 

TotaJ orientations 

URM 
Total orientations 
Nodes min/rnax 

BRM 

Distance tolerance 

Distance minimum 
Nodes min/max 
SGM 

Distance tolerance 

Distance minimum 
Nodes min/max 
BGM 

Distance tolerance 

Ligand bin width 

Receptor bin width 

Ligand bin overlap 

Receptor bin overlap 
Nodes min/max 



0-900,000" 

0-100,000*, 
3/3'' 

0-0.9 A" 

2.0 
3/3^ 

0-0.6 A* 
2.0 A'^ 

1.5 A^ 
0.1-0.9 A«'' 
0.1-0.9 A*'^ 
0.1-0,5 A^'^ 

0-1-0.5 A^-' 
4/4-« 



" For soma test systems, The upper limit was not reajdied if 
docking resutts convergad earfy. 

''Set to Three, because, as the size of a random match 
increases, the least-squares superpositjon procsdure in- 
creasingly tjiases Tt»e orientation toward the centrcsd of the 
site fXJints. 

"^Set large enough to exdude atoms sharing a covaierrt 
torrd. 

Minimum of four chosen so That chirailty could ba used to 
fitter matc*>es. Maximum of ten is sonrwwhat inconsl^ent 
with value chosen tor BGM, but we presume any effects of 
this would be small. 
* Chosen as historicaJ default 

^ Minimum value rwt zero because of a numericaJ InstebilrTy 
of The algorithm. 
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mizalion paiameters axe listed in Table IIT. The 
range of values in these tables correspond to tim- 
ings of less than 0.1 second/molecule to more than 
100 seconds/molecule on a modem workstation 
Csee subsequent text). 

For eadi docking run, several properties are 
computed that compare the resrdts from any spe- 
dBc run to the best results hxxcxx all runs conibined 
(assumjed to contain the global mixumum). These 
properties are summarized in Table IV. When these 
values are plotted versus time, ihe convergence of 
each property can be momtored. We assume that a 
better search, algorithm wjll lead to more rapid 
convergence. 

Whan considexing the average behavior of each 
property, we compute both the laatial mean and 
also a rank-weighted mean. The rank-weighted 
mean Is more sensitive to the behavior of the top 
scoring molecules, which are of most interest in 
database screening runs. Although many kinds of 
weigihting functions could be chosen for this pur- 



TABUE ill. 

Scortng and Optimiz ation Ponamet^s. 

Type 



Force field 



Bump maximum 
Dlelectnc 

Grid spadng 
Interpolation 

Convergence criteria 
Maximum derations 



3» 

Tri linear 

0.1 kcal/mol^ 
500*= 



"Monzero marimum allows some orientahons with limited 
Van der Waals clashes with Ihe receptor to bo recovered by 

^'A relaavely tight COTVorgerwe crtterla was saJected to re- 
cJucQ nolsa In the score evaluation, so mat dIffarerKes Ije- 
rw^n methods were more directly attributable to differences 
In sampling^ The rank conflation would be especially vulner- 
able to such noise. ' 
•^A targe ItenatJon limit was also selected to reduce noise In 
the score evaJuation by preventing the minimizar from t^ml- 
nating prematuraly. 



TABLE IV. 

Comparison Methods. 



Method 



Definition 



Equation 



Rank weighted 
equation 



Range 



Average 

relative 

score 



Rank 

Correlation 



Average 
RMS . 
error 



For each molecule in the 
docking run, nomnalize 
its score by the best 
score \t ever received 
In the site 
Then compute the 
average over the 
molecules In the set 

Assign a rank, yy, to 
each molecule based 
on Its best score in 
the docking run 
Then correlate with 
the rank of each molecule 
based on the best score 
It ever received in 
the sit8» Xf 

For each molecule in the 
docking run, compute 
the RMS error of its 
predicted orientation 
compared to that which 
received the best score 
for that site 
Then, compute the 
average over the 
molecules in the set 



N 



i: ix, - xf 

/=1 



N 

N 



lO, 1]; unitlase 



/=1 



N 



[-1, 1]; unrtless 



/-I 



/=1 ' 



lo, »]; in angstroms 
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TABLE V. 

Heceptof StructurB> trom PDB'^'^» Us^d for Twt Syatema. 



Code 



1210 



1ULB 



3DFR 



4FAB 



9HVP 



Structure 



DNA dcxJecamer 
with Netrospin^^ 

Purine nucleoside 
phodphorylase 
wrth guanine*'* 

Dihydrofolate 
reductase wrth 
NADPH and 
methotrexate*^ 

Fab fragment with 
fluorescein*^. 

HIV-1 protease 
wfth A-74704*^ 



Resolution 



2.2 A 



2.75 A 



1.7 A 



2.7 A 



2.8 A 



R factor 



0.198 



0.204 



0.152 



0.215 



0.182 



Srte description 



Site Is broad, presenting two continuous 
binding sites in the major and minor grooves 
of the Dh4A dodecamerr highly polar 

Site has two pockets; one la broad and centrally 
located, another where actuaj ligand binds 
te peripheral and solvent excluded 
Site has a deep, centrally located binding 
pocket mbced polar and nonpoiar regions 

S/te Is shallow wrth three pockets formed by 
the s'Dc hypervariable loops: generally 
nonpoiar 

Site Is a long, narrow tube, which completely 
penetrates protein; mixed polar and 
nonpoiar regions 



pose, we chose to use the reciprocal of the rank for 
convenience. 

To make sure that our conclusions are general- 
izable, we ax\alyzed the methods using five differ- 
ent receptor sites listed in Table V. These sites 
were chjosen from the list of complexes of high 
resolution, well-refined stmctuie with a ligaivd 
having a well-defined binding position. They were 
also chosen based on very different-shaped bind- 
ing sites. The chief features of each site are shown 
in Table V. 



Results and Discussion 

DOCKING CONDITIONS 

The test cases were prepared for docking in the 
standard way. Site paints were constructed using 
the sphere generatictn accessory program of DOCK 
with default parameters," We selected the cluster 
of site points which occupied the binding site of 
the actual ligand in the crystal complex. Within 
this cluster^ we merged the positions of tightly 
grouped site points using a 2-A cutoff. The final 
number of site points used for each receptor ranged 
jfrom 30 to 60. 

RESOURCE USAGE 

All docking calculations were performed on Sil- 
icon Graphics IhdigoZ workstations, equipped with 



200 MHz R4400 processors and 128 MB of RAM, 
so timings are consistent among the different 
methods. Several weeks of computer time were" 
required to complete all runs. AH methods re- 
quired approximately 13 MB of RAM to store the 
scoring grids. The URT arvd URM methods re- 
quired negligible additional memory for matching 
and orientirvg, BRM and SGM required up to 0*1 
MB of RAM for matching arrays. BGM required 1 
MB of RAM for matching arrays. 

TEST S1BTEM8 

Selected results for the 3DFR test system are 
presented in Figures 5, 6, and 7 to illustrate the 
type of data we collected . As shown in Figure 5, 
the weighted average score generally converges 
asymptotically to an optimum as sampling in- 
creases ► The scores from all matching methods 
converge to within 90% of the optimum in about 
10 seconds/molecule, whereas the URT meth- 
od requires about 100 secoruds/miolecule. The 
weigjited rank correlation in Figure 6 also shows 
convergent behavior, but with some interesting 
differences. It goes through much wider, fluctua- 
tions, indicating that small changes in score have 
large effects on the rankings of the top scoring 
molecules. It appears to discriminate among the 
different methods, selecting BRM and SGM as 
superior, BGM and TJRM as next best, and URT as 
worst agaiTL In particular, BRM and SGM both 
show a rapid initial rise, indicating liiat, with very 
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0-1 I 10 100 1000 

Time pcTTDolcculc (Sec) 

FIGURE 5. Weighted average scxire for molecules 
docked to 3DFH using SPHGEN aite points. Each curve 
represents a search algorithm In Table L Each data point 
Is a weighted average of the score for all molecules in a 
particular run using the equation m Table IV. 

little sampling, these methods come closest to pre- 
dicting the rap kings of the top scoring molecules. 
The convergence of weighted RMSD in Figure 7 
indicates hov7 long it takes the different methods 
to predict reprodudbly the same l?inding mode of 
the top scoring molecules. Although the two top 
performing methods comrerge in predicted score 




Tumfi per molecule (Sec) 



FIGURE 6. Weighted rank correlation for molecules 
docked to 3DFR using SPHGEN site points. Each data 
point IS an average of the rank correlation for all 
molecules in a particular run using the equations In 
Table IV. 




Hmc per molecule (Sec) 

FIGURE 7. Weighted average RMSD for molecules 
docked to aDFR using SPHGEN srte points- Only the 
two best search algorithms from Table I are presented. 
Each data point is an average of the RMSD for all 
molecules in a particular run using the equations in 
Table IV. 



and ranking in about 10 seconds^ they require 
about 500 seconds before they consistently predict 
the same binding mode- This result indicates that, 
for these molecules in this sitC/ several good scor- 
ing orientations must exist that are dose in score 
but distant in space. We found gimilar results for 
the oifaer sites as welL 

We found the weighted and imwelghted forms 
of the average score and rank correlatioiv to be the 
most relevant in assessing database screening per- 
formance. This gives us fotar measurements of five 
methods over five sites. Instead of presenting 100 
different curves, we have condensed each curve 
into a single value — the convergence time — which 
represents the time at which the 90% threshold 
value is passed (and not recrossed). In Table VI, 
we present the convergence times along with the 
speed improvement factor of each method com- 
pared to the TJRT method. 

SCORE CONVERGENCE 

With respect to the unweighted average score in 
Table VI(A), all matching methods show roughly 
equivalent convergence- and outperform URT by a 
factor of 10- Therefore, on average, site points 
provide a mtich more succinct desociptiQn of the 
active site than the smallest enclosing box, espe- 
cially when searching a site that is large or difficult 



11S4 



VOL 18, NO, 9 



AUTOMATED MOLECULAR DOCKING AND DATABASE SCREENING 



TABLE VL 



Convergence P 


ropertteii of Seofdi 


iMettioda,* 














Search Methods 






Site 


URT 


URM 


BRM 


SGM 


BGM 


(a) Unweighted Average Score 
121 D 200 (1)c) 
1ULB 200 dx) 
3DFR 100 (1)0 
4FAB 30 (1)0 
9HVP 200 (1)0 


10 (20)0 
10 (20x) 
9 (1Q)0 
7(4)0 
7 (SOx) 


20(1050 

2O(a>0 

10(10x3 

e(4x> 

5 (40x3 


10 (20x3 
20 (10x3 
10 (10k) 
7(5x) 
5 (40x) 


8 (30x:) 
8 (20)0 
20 (€E)c) 
7(4x) 

e (30x) 


Mean 


100(1)0 


9 doO 


10(10)0 


9 (10)0 


9 dOxr) 


(B] Rank-weighted average score 
121D 400 (Ixi 
1ULB 1000(1)0 
3DFR 200 (lx) 
4FAB 90 (1)0 
9HVP 200 (1)0 


10(40)0 
6O(20jO 
20 (9)0 
7(10)0 
7(30x3 


6(BOx3 
30(40x3 
10(20)0 
4(20x3 
2(100)0 


3 (lOQx) 
20 (70x3 
8 (30)0 
4(20)0 
4(40)0 


a (50x> 

30 (40x) 
20 (iQx) 
7(1 Ox) 
5 (40x) 


Mean 


300 (1)0 

* 


20(20)0 


7(40)0 


6 (SOx) 


10 (30x) 


(C) Unweighted rank correlation 
121D 400(1)0 
1ULB 1000(1)0 
3DFR 400 (1)0 
4FAB 200 (1)0 
9HVP 700 (1x) 


20 (20)0 
BO (200 
50 (930 
30 (7iO 
20 (40)0 


10 (30)0 
40(20x) 
20(20)0 
10(20x) 
8(90x3 


7(60x) 
30 (30x3 
10 (3Q?0 
20 (7x3 
10 (60x) 


10 (40x) 
60 (1 Ox) 
40 (lOx) 
50 (3x) 
20 (SOx) 


Mean 


500 (lx) 


30 (lOoO 


20 (30x3 


20 (30)0 


30 (20x) 


(D) Rank-weighted rank cx>rrelat!on 
121D 400(1)0 
1ULB 1000(1)0 
3DFR 600 (1)0 
4FAB 400 (Ik) 
9HVP 300 (1^0 


20 (3Q)0 
60 (20)0 

100 (exD 

10 (30x1 
7 (40)0 


10(40)0 
20(60x3 
10(40x) 
3 (100x3 
2(100x3 


7 (60)0 
20 (70)0 
10 (40k) 
4(90x) 
4 (60x3 


20 (30x) 
80 (20x) 
20 (30x) 
10 (30x) 
3 (80x) 


Mean 


500 (1)0 


20(20)0 


7(70)0 


8 (60)0 


20 (3O30 



* For each fiampling mathod and receptor site, we And iha amourrt of sampling, in seconds, beyond which ail values are wilhin 90% 
□f The maximum. With the time of URT ac the r^ferencs, the reiatw© apeed tector of each method is feporred In parentheses. The 
geometric mean over alt racepfnr ate values is reponed at bottom. Because of the laroe uncertainty in these values, all values are 
rounded to ona signrficant digrt. 



to defma a priori. Because iiie site points were 
generated based oxx general consideratioxis of 
shape, this result shoxild be generalixable to other 
"negative image" tecimlques, like the shape-based 
critical point methods of Nussinov and Wolfsen,^ 
and the energetic probe methods of Goodford.^ 
The rank-weigj\ted average score in Table VI(B) 
shows more discrimination among the matchir\g 
methods. While the two imifoim random methods 
(URT and UEM) had more difficulty converging 
with the top-scoring molec^lleSr BRM and SC3^ 
actually converged more quickly. This implies that 
trying a uniform number of orientations for the 



molecules in the database is inefficient with re- 
spect to processing the top-scoriivg molecules, BGM 
performed better than UEM, but not as well as 
either BRM or SGM. 

RAINK CONVERGENCE 

The convergence of the rank correlation further 
confirms the differences between rcuathods. The 
unweigihted rank correlations in. Table VI(C) again 
show a general 10- to 30-fold advantage of using 
site points to dock the molecules. Interestingly, the 
timfi required to get the rank correct is two- to 
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fivefold greater than the time to get the average 
score [Table VICA)] correct The weighted rank 
correlations ia Table VICD) also further discrimi- 
nate among the metiiods, UKM and BGM still 
show the 20- to 30-fald advantage over URT. BRM 
and SGM again outperform URT by 60- to 70-fbld. 
Of aU measures, this last one is arguably the most 
relevant to database scteenlngr because absolute 
scores are generally not used as strict cutoffs^ but 
instead the rankings are used to select some sub- 
group of molecules whose number is amenable to 
further processing. Most often, it is the top-scoring 
subgroup of most interest, so using a rank- 
weighted correlation shoiild focus our attention on 
how the methods treat this particular subgroup of 
molecules. Therefore, it appears that^ of the meth- 
ods investigated here, BRM and SGM are the best 
stiiied for database screening. 

BGM VERSUS SGM 

The singlle graph matching mefliod dearly out- 
performs the existing bipartite graph matching 
method by up to twofold in speed. This result may 
appear coxmterintuitive^ because SGM is exhaus- 
tive, whereas BGM uses heuristics to speed the 
search. However, the precomputing of the adja- 
cency matrix and the rapid processing of the adja- 
cency lists show that reformulating the problem 
into an efficient form can be just as effective as 
using heuristics. SGM has the additional advan- 
tage of requiring fewer fundamental matching pa- 
rameters than EGM (Table H). 

BRM VERSUS SGM 

Why does a simplistic random matching algo- 
rithm, BRM, perform so competitively with the 
distance matching algorithm/ SGM, and even out- 
perform BGM? I>oe9 this suggest that distance 
matdung is an unnecessarily complicated solution 
to docking? We seek to resolve this question by 
breaking tiie problem into three parts. 

1, What is the disadvantage of sampling each 
molecule taniformly? 

X Why does distance matching sample mole- 
cules rumimiformly? 

3. Is an orientation from random matching just 
as good as one from distance matching? 

First, speruling the same time on each molecule 
may result in spending too little time on the better 



scoring molecules. One feature of the force ft^ld 
scoring function is that it tends to favor larger 
molecules .^^ For the set of molecules \ised in. this 
study, we have plotted the best score for each 
molecule against its size in Figure 8. Although 
there is some trend, the correlation coefficient is 
not large. A stronger trend exists in Figure 9, 
relating molecule size to bump filtering. Large 
molecules have a greater propensity to bump into 
receptor atoms when oriented in the site. Because 
we use a bump filter in DOCK to discard poor 
orientations before the more compulationalLy ex- 
pensive scoring and optimization steps^ we are 
more likely to discard an orientation of a large 
molecule than ihat of a small molecule. Forcing a 
uniform number of matches per molecule would 
then result in a size-biased attrition through the 
bump filter and, overall, spending less time on the 
potentially better scoring, larger molecules. 

Second^ nonxmiform sampling arises in distance 
matching because tine number of matches is re- 
lated to the number of internal distances that the 
Hgand has in common with the site points. Larger 
molecules have more internal distances, and so 
will tend to have more in common. Therefore^ 
larger molecules tend to generate more matches 
ttian smaller molecules. However, a molecule that 
is itot larger, but rather similar in shape to the 
bixuiing site, wiE also have more internal distances 
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FIGURE 8. How molecule size affects force field score. 
The best force field score for each molecule is plorted 
versus the number of nonhydrogen aroms in the 
molacula. The plots for each receptor are pooled into 
this single plot to show the overall trend. Frtting a line to 
These data yields an fl^ value of 0,24. 



nee 



VOL 18. NO. 9 



AUTOMATED MOLECULAfl DOCKING AND DATABASE SCREENING 



o 

"5 



'J3 



1.0 



0.8 



0.6 



0.4 



BKM 




0.6 



0.4 



0.2 



0.0 



• •■ • I m . 



15 



20 



25 



30 



35 



Number of Non-Hydrogen Atoms 



FIGURE 9- Hcrw molecule size affects bump filtering. In 
a given clocking run, we compute the fraction of 
orientations that pass the bump filter for each molecule, 
and then plot this value against the size of the molecule. 
(Top) BRM method. Average fitter rate is 0.031, (BoHom) 
SGM method. Average fitter rate is 0^6. 



in commnn. Therefore, distance matching will 
spend more time on muolecules that are comple- 
mmtary in shape to the receptor. Thus, the BKM 
method benefits from using distance matching to 
detemnine the number of orientations to try simply 
because it will bias its efforts toward the larger 
and/or more complementary molecules- 

Tldrd, an orientation from random matching Is, 
on average, not as good as one from distance 
matching. We can check the quality of these orienr 
talions by examining how they survive tine bump 
filter as depicted in Figure 9. The orientations from 
distance matching are at least ten times more likely 
to make it past this filter, so they are indeed 



superior. To understand why this superiority does 
not translate directly into faster docking, we must 
consider the computational bottleneck of the cur- 
rent implementation of DOCK. If the eatly docking 
steps — ^matching, orienting, and bump filtering- 
were to consume a dominant portion of the total 
cpu time, then SGM would be up to ten tiroes 
faster than BRM. On the other hand, if the later 
docking steps — scoring and optimization — were 
to consume a dominant portion of the ix>tal cpu 
time, then BRM would be equivalent in speed to 
SGM. With the current implementation of the min- 
imizer, the scoring and optimization steps indeed 
consume an overwhelming portion of cpu time 
(> 90%), so tiie ability of distance matching to 
form high-quality orientations is for now tmre- 
war ded. BRM is kept competitive because, in con- 
cert with heavy pnming by the bump filter, it 
forms orientatians that are suitable starting points 
for optimization. 

BRM VERSUS BGM 

The fact that BRM outperforms BGM by up to a 
factor of two points out the critical importance of 
how processing time is allocated amxmg the differ- 
ent molecules in tiie database. Por maximum, effi- 
ciency, a search algorithm must bias its efforts 
toward the mi^st highly tanked molecules, BGM 
indeed has such a bias built in, but the heuristics 
have the effect of reducing the magnitude of bias 
inherent to the unrestrained matching found in 
SGM and BRM. 

FUTURE DIRECTIOIVS 

It appears sensible to incorporate the single 
graph matching technique into version 4.0 of 
DOQs because its results are of high quality and 
its potential for speed improvement is high. The 
speed gains might arise hrom additional orienta- 
tion filtering or by fundamental improvements in 
the optimization technique. The new matching al- 
gorithm will provide a solid algorithmic foxxnda- 
tion on which to base further development of 
molecular docking, including the addition of so- 
phisticated search constraints like chemical labels 
and critical dusters, as well as the explicit treat- 
ment of ligand flexibility. 

Until a faster scoring and optimization method 
is implemented, it may be useful to preprocess the 
set of orientations generated by matching- For in- 
stance, an implementation of degeneracy checking 
has been tried in which similar orientations are 
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removed at the matching gtage-^ Because orient- 
ing is also relatively facile, it Implifis that degener- 
acy checking coidd instead be performed as an 
KMS deviation calculation between real orienta- 
tions. Earey et al. present a rapid KMSD evalua- 
tion technique based directly on the rotation 
matrices^'' that would he applicable. It might also 
be possible identify a unique property of orienta- 
tions that lead to the best scores upon minimiza- 
tion (e.g., degree of surface overlap ivith receptor). 
We intend to further study tihe nature of orienta- 
tions generated by BRM that pass the bump filter, 
to see why they manage to scoie so well upon 
mlnimizatiorL Based upon this knowledge, a f0.ter 
coxild be constructed ihat enriches the set of orien- 
tations generated by regular matching prior to 
minimization^ 

An additional avenue for improvement would 
be to try alternative methods to generate site 
points. The current shape-based site points may 
not be entirely consistent with a force-field-baaed 
scoring function-^^ Including force-field considera- 
tions during site point construction, might make 
the site points more relevant as points on whidi to 
position ligand atoms^ thereby increasing the qual- 
ity of docked orientations. 

We are working to extend the current docking 
protocol to include ligand and receptor flexibility, 
Ligand ilexibility can be incorporated in several 
ways, H\e most straigjitforward approach is to 
dock multiple conformations of the ligand sepa- 
rately-^^ A potentially more effidaat method is to 
use distance geometry to build a ligand conforma- 
tion that fits a subset of receptor site points.^^ 
Another viable option is the "divide and conquer'' 
strategy, in which a flexible molecule is broken 
into rigid fragments, the fragments are docked 
independentiy, and the molecule is rebuilt from 
adjacent fragment orientations-^^"^^ Limited recep- 
tor flexibility is being investigated during the step 
of score evaluation by superimposing multiple re- 
ceptor conf ormers on a single score potential grid.^ 



Conclusion 

We evaluated various search algorithms for au- 
tomated molecular docking that range in complex- 
ity from purely random to site point-directed^ 
graph-theoretical matching methods. Our basis of 
comparison was how quickly each docking method 
could correctiy score and rank a database of 
molecules. Over a broad range of active site envi- 



ronments, it is at least tenfold more efficient to use 
a collection of site points to describe the active site 
search volume than to use the smallest enclosing 
box. Using graph theoretical matching techniques 
boosts this relative efficiency higher. The bipartite 
graph matching used in the current DOCK version 
3-5 improves efficiency up to 30-fold. The single 
graph matching proposed for DOCK version 4.0 
improves efficiency up to 60-fold. Because single 
graph matching is not only faster, but also less 
complicated than the bipartite graph matching 
used in version 3.5, we feel it will be an important 
advance in docking technology. 
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ABSTRACT: A novel and robust automated docking method that predicts the 
bound conformations of flexible lijgands to macxomolecnlar targets iias been 
developed and tested, in combination with a new scoring function that estimates 
the free energy change upon binding. Interestingly; tids method applies a 
Latnarckian model of genetics^ in which environmental adaptations of an 
individual's phenirtype are reverse trai^scribed into its genotype and become 
heritable traits (sic). We consider three search methods, Monte Carlo simulated 
annealing, a traditional genetic algorithm, and the Lamarckian genetic algorithm, 
and compare their performance in dockings of seven protein-ligand test systems 
having known three-dimensional structure. We show that both the traditional 
and Lamarckian genetic algorithms can handle Hgands with more degrees of 
freedom than the simidated annealing method used in earlier versions of 
AUTODOQ^ and that the Lamarckian genetic algorithm is the most efficient 
reliable, and successful of the three. The empirical free ei\er§y function was 
calibrated using a set of 30 aixucturally known protein-ligand complexes with 
experimentally determined binding constants. Linear regression analysis of the 
observed binding constants in terms of a wide variety of structure-derived 
molecular properties was performed. The final model had a residual standard 
error of 911 kj mol"^ (2.177 kcal mol"^) and was chosen as the new energy 
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fimctioiu The new search methods and empirical free enerev funrHrvn 



Introduction 

A fast atom-based compuiatinnal docking tool 
is essential to most techniques for structure- 
based drug design,^' ^ Reported techniques for au- 
tomated docking fall into two turoad categories: 
matching methods and docking simulation meth- 
ods.^ Matching methods create a model of the 
active site, typicaUy including sites of hydrogen 
banding and sites that are sterically accessible, and 
then attempt to dock a given inhibitor structure 
mto the model as a rigid body by matching its 
geometry to that of the active site. The most suc- 
cessful example of ttiis approach is DocK,^'^ which 
is efficient enough to screen entire chemical 
databases rapidly for lead compounds. The second 
class of docking techniques model the docking of a 
ligand to a target in greater detail: the ligand 
begins randomly outside the protein, aiuj explores 
translations, orientations/ axui confoxmatLons until 
an ideal site is found. These techniques are typi- 
cally slower than the matching techniques, but 
the^ allow flexibility within the ligand to be mod- 
eled and can utilize more detailed molecular me- 
chanics to calculate the energy of the ligand in the 
context of the putative active site. They allow 
computational chemists to investigate modifica- 
tions of lead molecules suggested by the chemi- 
cal intuition and expertise of organic synthetic 
chemists. 

AtjtoDcoc^'^ is an example of the latter, more 
physically detailed, flexible docking technique. 
Previous releases of AutdDock combine a rapid 
grid-based method for energy evaluation,®-^ pre- 
calculating ligand-protein pairwise interaction en- 
ergies so that they maybe ixsed as a look-up table 
during simulation, with a Monte Carlo simulated 
annealing search"- for optimal conformations of 
Hgands. AxnX)DoCK has been applied with great 
success in the prediction of boimd canfotmations ' 
of enzyme-inhibttor complexes,^ peptide-anti- 
body complexes,^* and even protein-protein inter- 
actions^; these and other applications have been 
reviewed elsewhere.^ 



We mitiated the current work to remedy two 
limitations of A-utoDock. (i) We have found that 
the simulated annealing search method performs 
well with ligands that have roughly eight roiatable 
bonds or less: problems with more degrees of 
freedom, rapidly become intractable. This de- 
manded a more efficient search method, (ii) 
AutoDock: is often used to obtain unbiased dock- 
ings of flexible inhibitors in enzyme active sites- in 
computer-assisted drug-design, novel modifica- 
tions of such lead molecules can be investigated 
computationally. Like many other computational 
approaches, AuroDocK performs weU in predict^ 
ing relative quantities and rankings for series of 
similar molecules; however, it has not been possi- 
ble to estimate in AutdDock whetiier a ligand will 
bind with a millimolar, micromolar, or nanomolar 
binding constant. Earlier versions of AxtoDock 
used a set of traditional molecular mechanics 
foroe-iield parameters that were not directiy corre- 
lated with observed binding free energies; hence, 
we needed to develop a force field that could be 
used to predict such quantities. 

Molecular docking is a dLfficult optimization 
problem, requiring efficient sampling across Ihe 
entire range of positional, orientational, and con- 
formational possibilities. Genetic algorithms (GA) 
fulfilt the role of global search particularly well, 
and are increasingly being applied to problems 
that suffer from combinatorial explosions due to 
ttieir many degrees of freedom. Both canonical 
genetic algorithms^''"^^ and evolutionary program- 
ming methods^ have been shown to be successful 
in both drug design and docking. 

In this report, we describe two major advances 
that are included in the new release of AtjtoDco:, 
version 3.0. The first is the addition of three new 
search methods; a genetic algorithm; a local search 
method; and a novel, adaptive global-local search 
method based on Lamarckian genetics, the La- 
marckian genetic algorithm (LGA). The second ad- 
vance is an empirical binding free energy force 
field that allows the prediction of binding free 
energies, and hence binding constants, for docked 
ligands. 
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GENETIC ALGORITHMS 

Gm£tk olgoritkmB^ use ideas based on the lan- 
guage of natural genetics and biological evolu- 
tion?* In the case of molecular docking, the partic- 
ular arrangemmt of a ligand and a protein can be 
defined by a set of values describirig the transla- 
tion, orieniation, and conformation of the ligand 
with respect to the protein: these are the ligand^ 
state variables and, in the GA, each siate variable 
corresponds to a gene. The ligand's state corre- 
sponds to the genotype^ whereas its atomic coordi- 
xmtes correspond to the phenoiype In. molecular 
docking, the fitness is the total interactton energy 
of the ligand with the protein, and is evaluated 
using the energy function. Random pairs of indi- 
viduals are mated using a process of crossover, in 
which new individuals inherit genes from either 
parent. In addition, soma offspring undergo ran^ 
dom mutation, in which one gene changes by a 
random amount Sdsdion of the offspring of the 
current generation occurs based on the individual's 
fitness: ^u$, solutions better suited to their envi- 
ronment reproduce^ whereas poorer suited ones 
die, 

A variety of approaches have been adopted to 
improve the efficiency of the gmetic algorithm- 
Classical genetic algorithms represent the genome 
as a fixed-length bit string, and employ binary 
crossover and binary mutation to generate new 
individuals in the population. Unfortunately, in 
Eoany problems^ sudi binary operators can gener- 
ate values that are often outside the domain of 
interest, leading to gross inefficiencies in the search. 
The use of real encodings helps to limit the genetic 
algorithm to reasonable domains. Alternative ge- 
netic algorithms have been reported^ that employ 
more complicated representations and more so- 
phisticated operators besides crossover and muta- 
tion. Some of these retain the binary represen- 
tation^ but must employ decoders and repair 
algorithms to avoid building illegal individuals 
from the chramosome, and these are frequently 
compuiatianally intensive. However, the search 
performance of the genetic algorithm can be im- 
proved by introducing a local search method,^' ^ 

HYBRID SEARCH METHODS IN AUTODOCK 

EafHer versions of AUTODocK used optimized 
variants of simulated annealing.^' ^ Simulated an- 
nealing maybe viewed as having both global and 
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local search aspects, performing a more global 
search early in the nin, when higher temperatures 
allow transitions over energy barriers separating 
energetic valleys^ and latex on performing a more 
local search when lower temperatures place more 
focus on local optimization in the current valley. 
AuTODOCK 3>0 retains the functionality of earlier 
versions, but adds the optioxis of using a genetic 
algorithm (GA} for global searching, a local search 
(JJS) method to perform energy ininirnizatlon, or a 
combination of both, and builds on the work of 
Belew and Hart'^'^^ The local search method is 
based on that of Solis and Wets/^ which has the 
advantage that it does not require gradient inform 
mation about the local energy landscape, thus fa- 
cilitating torsional space search. In addition, the 
lcH:al search method is adaptive, in that it adjusts 
the step size depending upon the recent history of 
energies: a user-defined number of consecutive 
failures, or increases in energy, cause the step size 
to be doubled; conversely, a user-defined number 
of consecutive successes, or decreases in energy/ 
cause the step size to be halved. The hybrid of the 
GA method with the adaptive LS method together 
form the so-called Larmirckian genetic algorithm 
CLGA), which has enhanced performance relative 
to simulated annealing and GA alone,^^'^* and is 
described in detail later. Thus, the addition of 
these new GA-based docking methods enhances 
AiiroDocK/ and allows problems with more de- 
grees of freedom to be tackled. Furthermore, it is 
ruDw possible to use the same force field as is used 
in docking to perform energy minirnization of 

ligands. 

IMPLEMENTATION 

In our implementation of the genetic algorithniy 
the chromosome is composed of a string of real- 
valued gettes: three Cartesian coordinates for the 
ligand translation; four variables defixung a 
quaternion specifying the ligand orientation; and 
one real-value for each ligand torsion, in that or- 
der* QuaiemionLS are used to define the orienta- 
tion^^ of the ligand, to avoid the gimbal lock 
problem experienced with Etiler angles.^^ The or- 
der of the genes that encode the torsion angles is 
defined by the torsion tree created by AuroTORS, a 
preparatory program used to select rotatable bonds 
in the ligand. Thus, there is a one-to-one mapping 
from the ligand's state variables to the genes oi the 
individual's chromosome. 

The genetic algorithm begins by creating a rarv- 
dom population of individuals, where the user 
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defines the number of individuals in the popula- 
iioTL For each random individual in ihe initial 
population^ each of the three translation, genes for 
X, y, and z is given a uniformly distributed rart- 
dom value between ihe minimum and maximum 
X, y, and z extents of the grid maps, respectively; 
the four genes defining the orientation are given a 
random quatemiorv consisting of a random tmit 
vector and a random rotation angle between 18(r 
and + 180°; and the torsion angle genes, if any, are 
given random values between — 180° and + 180''- 
Furthermore, a new random nimiber generator has 
been introduced that is hardware-independent^ It 
is used in the LS, CA, and LGA search engines, 
and allows results to be reproduced on any hard- 
ware platform given the same seed values. The 
creation of the random initial population is fol- 
lowed by a loop over generations, repeating until 
the maximum number of generatians or the maxi- 
mum nimiber of energy evaluations is reached, 
whichever comes first. A generation consists of 
five stages: mapping and fitness evaluatiart selec- 
tion, crossover, mutation, and elitist selection^ in 
that order. In the Lamarckian GA, each generation 
is followed by local search, being performed on a 
user-defined proportion of the population. Each of 
these stages is discussed in more detail in what 
follows. 

Mapping translates from each individual's geno- 
type to its corresponding phenotype, and occurs 
over the entire population. This allows each indi- 
vidual's fitness to be evaluated- This is the sum of 
the intermolecular interaction energy between the 
ligand and the protein^ and the intramolecular 
interaction energy of the ligand. The physicochemr 
ical nature of the energy evaluation function is 
described in detail later. Every timje an individual's 
energy is calculated, eitlier during global or local 
search^ a coimt of the total number of energy 
evaluations is irucremented. 

This is followed, in our implementation, by pro- 
portional selection to decide which individuals wiU 
reproduce. Thus, individuals that have better- 
than-average fitness receive proportionally more 
offspring, in accordance with: 



</> 



/» ^ </> 



where is the integer number of offspring to be 
allocated to the individual; is the fitness of the 
individual Q.e., the energy of the ligand); is the 
fitness of the worst individual, or highest energy. 



in the last N generations (i.e.y is a user-<lefina- 
ble parameter, typically 10); and </> is the mean 
fitness of tiie population. Because the worst fitness, 
/n,, will always be larger than either /, or </), 
except when fi « then for individuals that have 
a fitness lower than the mean, < </>, the nu- 
merator in this eqxiation, ^ fu always be 
greater than the denominator — ^/)^ ^d thus 
such individuals will be allocated at least one. 
offspring, aiui thxis will be able to reproduce. 
AUTODOCK checks for — {/) beforehand, and if 
true, the population is assumed to have con- 
verged, and. the docking is terminated. 

Crossover and mutation are performed on ran- 
dom members of the population according to 
user-defined rates of crossover and mutation. First, 
crossover is performed. Two-point crossover is 
used^ with breaks occurring oidy between genes, 
never within a gene — this prevents erratic changes 
in the real values of the genes. Thus, both parents' 
chro3Diosomes would be broken into three pieces at 
the same gene positions, each piece containing one 
or roore genes; for instance, . ABC and ahc. The 
chromosomes of the resulting offspring after two-, 
point crossover would be AhC and aBc^ These 
offepring replace the parents in the population^ 
keeping the population size constant. Crossover is 
followed by mutation; because the translational, 
orientational, and torsional genes are represented 
by real variables, the classical bit-flip mutation 
would be inappropriate- Instead, mutation is per- 
formed by adding a random real number that has 
a Cauchy distribution to the variable, the distribu- 
tion being given by: 



C(a,j3, x) - 



7r(^^+(x-a)) 

a 5i 0, p > 0, —CD < X < °o 



where at and /3 are parameters that affect the 
mean and spread of the distribution. The Cauchy 
distribution has a bias toward small deviates, but, 
vinlike the Gaussian distributioiVr it has thick tails 
that enable it to generate large changes occasion^ 

An optional user-defined integer parameter 
elitism determines how many of the top individu- 
als automatically survive kito the next generation. 
If the elitism parameter is non-zero, the new popu- 
lation that lias resulted from the proportional se- 
lection, crossover, and mutation is sorted accord- 
ing to its fitness; the fitness of new individuals 
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iiaving resulted from crossover and/or mutation 
is calculated as necessary. Because populations are 
implemented as heaps, selection of the best n 
individuals is efficient. 

The genetic algorithm iterates over generations 
until one of the tenninatian criteria is met At the 
end of each dcxrking, AxjtdDock: reports the fitness 
(the docked energy), the state variables, and ihe 
coordinates of the docked conformation/ and also 
the estimated free energy of binding. AutoDock 
performs ihe user-^cdfied number of GA dock- 
ings, and then carries out conformational clxaster 
analysis on the docked conformations to determine 
which are similar, reporting the clusters ranked by 
increasing energy. 

LAMARCKL\N G£NETIC ALGORTIHIVf 

The vast majority of genetic algorithms mimic 
the major characteristics of Darwinian evolution 
and apply Mendelian genetics. This is illustrated 
on ihe right-'hand side of Figure 1 (note the one- 
way transfer of information from the genotype to 
the phenotype)» However, in those cases where an 
inverse mapping function exists (i.e., one which 
yields a genotyf^ from a given phenotype), it is 
possible to firdsh a local search by replacing the 
individual with tbie result of the local search; see 
the left-hand side of Figure 1. This is called the 
t-axnarckian genetic algorithm (LGA), and is an 
allusion to Jean Batiste de Lamarck's (discredited) 
assertion that phenotypic characteristics acquired 
during an individual's life time can become herita- 
ble traits.^^ 

The most important issues arising in hybrids of 
local search (LS) techniques with the GA revolve 
around the developmental mapping, which trans- 
forms genotypic representations into phenotypic 
ones.^ The genotypic space is defined in terms of 
the genetic operators — mutation and crossover in 
oiir experiments— by which parents of one genera- 
tion are perturbed to form their children. The phe- 
notypic space is defined directly by the problem* 
nainely, the energy function being, optimized. The 
local search operator is a useful extension of GA 
global optimization when there are local "smooth- 
ness'^ characteristics (continuity, correlatioiv etc>) 
of the fitness function that local search can exploit, 
ia hybrid G A + LS optimizations, the result of ttie 
LS is always used to update the fitness associated 
with an individual in the GA selection algorithm- 
If, and only if, the developmental mapping func- 
tion is invertihle, will the Lamarddan option — 
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FIGURE 1- This figure illustrates genotypic and 
phenotypic search, and contrasts Darwinian and 
Lamarckian search.^^ The space of the genotypes is 
represented by the lower horizontal line, and the space 
of the phenotypes is represented by the upper horizontaj 
line. Genotypes are mapped to phenotypes by a 
developmental mapping function. The fitnesis function Is 
fix). The result of applying the genotypic mutation 
operator to the parent's gonotype Is shown on the 
rigtrt-hand side of the diagram, and has the 
corresponding phenotype shown. Local search is shown 
on the left-hand side, ft is normally performed in 
phenotypic space and employs infonngtion about the 
-fitness landscape. Sufficient "rterations of the local search 
arrr/e at a local minimum, and an inverse mapping 
function is used to convert from its phenotype to its 
corresponding genotype. In the case of molecular 
docking, however local search is performed by 
contihuoualy convening from the genotype to the 
phenotype, so ir\verae mapping Is not required. The 
genotype of the parent is replaced by the resulting 
genotype, however, in accordance with Lamarckian 
prlnaples. 



converting the phenotypic result of LS back into its 
corresponding genotype become possible. 

In our case, the fitness or energy is calculated 
from the ligand^s coordinates/ which together form 
its phenotype. The genotypic representa.tion of the 
ligand, and its mutation and crossover operators, 
have already been described- The developmental 
mapping simply transforms a molectxle^'s geno- 
typic State variables into the corresponding set of 
atomic coordinates. A novel feature of this applica- 
tion of hybrid global-local opttmizatLon is that the 
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Solis and Wets US operator searches through the 
genotypic space rather than the more typical pher 
notypic space. This means that the developmental 
mapping does not need to be inverted. Nonethe- 
less^ this molecular variation of the genetic algo- 
rithm still quallEes as Lamarckian, because any 
''environmental adaptations'' of the ligand ac- 
quired dnrrng the Itical search will be iiiherLted by 
its offspring. 

At each generation, it is possible to let a user- 
defined fraction of the population undergo such a 
local seardu We have found improved efficiency 
of docking with local search frequencies of just 
0.06, although a frequency of 1.00 is not signifi- 
cantly more efficient.^ Both the canonical and a 
slighdy modified version of the Solis and Wets 
method have been implemented- In canonical Solis 
and Wets, the same step size would be used for 
every gene^ but we ha^ improved the local search 
efficiency by allowing the step size to be differ- 
ent for each type of gene: a change of 1 A (1 A = 
10"^° rcO in a translation gene cotild be much 
more significant than a dhange of 1** in a rotational 
or torsional gene. Ici the docking experiments pre- 
sented here, the translational step size was OJZ A, 
and the orientational and torsional step sizes 

were 5"*- 

In the Lamarckian genetic algorithm, genotypic 
mutation plays a somewhat different role than it 
does in traditional genetic algorithms. Traditionr 
ally, mutation plays the role of a local search 
operator, allowing smalls refining moves that are 
not efficiently made by crossover ai\d selection 
alone. With the explicit local search operator, how- 
ever, this role becomes unnecessary, and is needed 
cmly for its role in replacing alleles that might have 
disappeared through selection. In LGA^ mutation 
can take on a more exploratory role. The Catichy 
deviates are a compromise between radical jumps 
to arbitrary sections of the conformation space and 
detailed exploration of the local topography. 

DERIVATION OF THE EMPIRICAL BINDING 
FREE ENERGY FUNCTION 

The study of molecular structure underpins 
mucii of computaiicmal molecular biology. There 
are several established methods for performing 
molecular mechanics and molecular dynamics^ no- 
tably AMHER,^'3s Chasmm/^ Dbcover,^^ Ecepp,^ 
and GROMOS,^ Many of ttiese traditional force 
fields model the interaction energy of a molecular 
system with terms for dispersion/repulsion,^ hy- 
drogen bonding,*^ electrostatics,^'*^ and devia- 



tion from ideal bond lengths and bond angles . 
These methods are excellent for studying molecu- 
lar processes over time, for optimizing bound con- 
formations, and for performing free energy per- 
turbation calculations between molecules wifri a 
sin^ atom change,^ but they often require con- 
siderable investments of computer time and, un- 
fortunately, these approaches tend to perform less 
well in ranking the binding free energies of com- 
pounds that differ by more than a few atoms. 
What is needed is an empirical relationship 
between molecular- structure and binding free 
energy. 

The first thoroughly established linear free en- 
ergy relationship was observed by Hammett as 
early as 1933^ and reported in 1937,^ It was used 
to relate structure and reactivity of small organic 
molecules on a quantitative basis. Hammett was 
able to. derive substituent constants and reaction 
constants that could then be used to calculate rate 
constants and equilibrium constants for a specific 
reaction of a specific compound. It could be said 
that Hammett's work was the forerunner of mod- 
em-day quantitative structure-activity relation-, 
ships (QSAR), pioneered by Hansch and cowork- 
ers la the 1960s, Here it is assumed that the sum of 
the steric, electronic, exid hydrophobic effects of 
substituents in a compoimd determines its biologi- 
cal activity; sec, for example, Fujita,** Hansch,^ 
and more recentiy Selassie et al,^ 

Current structure-based scoring functiorvs seek 
to remedy some of the deficiencies of traditiDnal 
force fields by developing empirical free energy 
functions that reproduce observed binding con- 
stants. Most of these approaches use an expanded 
"master equation" to model the free energy of 
bindings adding entropic terms to the molecular 
mechanics equations^^: 

+ AG^, + AG,,, 

where the first four terms are the typical molecular 
medianics terms for dispersion/repulsion, hydro- 
gen bonding, electrostatics, and deviations from 
covalent geometry, respectively; AG^^^ models the 
restriction of internal rotors and global rotation 
and translation; and AG^i models desolvation 
upon binding and the hydrophobic effect (solvent 
entropy dianges at solute-solvent interfaces). This 
latter term is the most challenging. Most workers 
use variants of the method of Wesson and Eisen- 
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berg,^ calculating a desolvation energy based on 
the surface area buried upon complex formation, 
with, the area of each buried atom being weighted 
hy an atomic solvation parameter. Bohm built on 
eatHer work with the de novo inhibitor design 
program Lttoe/^ and used linear regresaian to cali- 
brate a similar function against a set of 45 diverse 
proteitv-ligand complexes with published binding 
constants- The final functicfl^ predicted binding 
constants for a set of test complexes with a stan- 
dard deviation equivalent to about a factor of 25 in 
binding ccmstanti more than sufficient to rank in- 
hibitors with millimolax, micromolar, and nanomo- 
lar binding constants. Jain devised a continuous. 



diffetentiable scoring functiorv which is, in 
essence, very similar to that of Bohm/ but based on 
non-physical pairwise potentials using Gausslans 
and sigmoidai terms. 

We have implemented a similar approach using 
the thermodynamic cycle of Wesson and Eisen- 
berg.^ The function includes five terms: 



AG - AG. 



12 



A) 



+ AG^iec L 



(1) 



where the five AG terms on the zight-hand side 
are coefficients empirically determined using hn- 
ear regression analysis from a set of protein-ligand 
complexes witii known binding constants^ shown 
in. Table I. The summations are performed over all 
pairs of ligand atoms, ij and protein atoms, /, in 
addition to all pairs of atoms in the ligand that are 
separated by three or more bonds. 

The in vacuo contributions include three interac- 
tion energy terms, used in previous versions of 
AUToDcoc: a Lennard-Jones 12-6 dispersioFn/re- 
pulsion term; a directional 12-10 hydrogen bond- 
ing term, where Bit) is a directional weight based 
on the angle, t, between, the probe and the target 
atom'; and a screened Coulombic electrostatic po- 
tential.^^ Each of these terms, including their pa- 
rameterization, have already been desmbed/ 

A measure of the imf avorable entropy of ligand 
binding due to the restriction of coriforrriatianal 



degrees of freedom is added to the in vacuo func- 
tion. Ihis term is proportional to the number of 
sp^ bonds in the ligand, Nj^^..^ We investigated 
varianls that included and excluded m^thyL hy- 
droxy!, and amine rotors. 

In the development of an empirical free energy 
function for AUTODocK, the desolvation tarm was 
most challenging:, because AXJTor)ciCK: uses a grid- 
based method for energy evaluatioiu and most 
published solvation methods are based on surface 
area calculations. We investigated two different 
methods of calculating the desdvatiuon energy 
term. The first of these methods was based on 
estimating atom-by-atom contributians to the ia- 
terf adal molecular surface area between the ligand 
and the protein using the difference in the surface 
areas of the complex and the unbovmd protein and 
unbound ligand. Both the solvent-accessible and 
solvent-excluded surface areas were considered, 
being calculated with MSMS/^ a fast and reliable 
program that computes analytical molecular sur- 
faces. Unfortunately, there can be significant errors 
in the value of the interfacial solvent-accessible 
sttfface areas, due to the '^collar^' of accessible- 
surface that surrounds the ligand— protein interface 
in the complex. We also tested seven variants of 
the pairwise, volume-based method of Stouten 
et al.^: this method has the advantage that it is 
consistent with the pre-calculated affinity grid for- 
mulation used by AOTODocK. For each atom in the 
ligand/ fragmental volumes of surrounding protein 
atoms are weighted by an exponential function 
and then summed, evaluating the percentage of 
volume around the ligand atom that is occupied 
by protein atoms. This percentage is then weighted 
by the atomic solvation parameter of the ligand 
atom to give the desolvation energy- The full 
method may be broken into four separate compo- 
nents: bxtdal of apolar atoms in the ligand, btuial 
of apolar protein atoms, burial of polar and charged 
atoms in the ligand, and burial of polar and 
charged protein, atoms. Great success has also been 
reported in using simply the amount of hydropho- 
bic surface area buried upon complexation as a 
measure of the '"hydrophobic effect/'^ so we 
tested several formulations that included only the 
volume lost around ligand carbon atoms. The 
burial of polar atoms caused particular problems, 
as discussed in what foILows. Apart from the vol- 
ume-based method, we tested a simpler formula- 
tion for the solvent transfer of polar atoms; that is, 
a constant term corresporvding to the favorable free 
energy of interaction of a polar atom with solvent 
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TABLE I. . 

Prote&n-Ugand Comptwes Uaed to Callbreto Empirical Free Enorgy Funcuon, Along wfth Brookhaven Protein 
Dsta Bank (PDB) AcceMfon Codes and Binding. 
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HIV-1 Protease /XK263 


1hvr 


9.51 


Farty-acid-binding protein / C^gCOOH 


2Ifb 


5-43 


Myoglobin (ferric^ / Imidazole 


1mbi 


1.88 


MCPG603 / phosphochollne 


2mcp 


5.23 


^Trypsin / benzamldlne 


aptb 


4.74 \ 


Retino^blndlng protem y retinol 


lrbp 




Thermolysin / Lou-hydroxylaminB 


4tln 


3.72 


Thermolysin / phosphoramidon 


Itip 


7.55 


Thermolysin / n-(l -carboxy-3-pheny IpropyO-Ueu-Trp 


Itmn 


7.30 


Tliermolysin/Cbz-Ph^^Leu-Aia (ZFpLA) 


4tmn 


10,19 


Thermolysin /Cbz-Gly-p-Leti-Leu (ZGpLL) 


5tmn 


B.04 


Purine nucleoside phosphorylase (PNP) / guanine 


1ulb 


5.30 


Xylose isomerase/CB37i7 


2xis 


5.82 


Triose phosphate isomerase (TIM) / a-phosphoglycol'ic add {PQAJ 


2yp! 


4.82 



* Adapiad from aShm. 



is estimated^ and. this is subtracted from the bind- 
ing free energy^ 

Trilirvear interpalatian is used to evaluate 
rapidly the intermolecular dispeision/repulsion 
energy^ the hydrogen banding energy, the electro- 
static potential, and the solvation energy of each 
atom in the Hgand, using grid maps that have been 
pre-calcuiated over the protein for each atom type 
in the ligand. In AxJTODocK 3.0, we have imple- 
mented a faster method of trilineaj interpolation^^ 
than was available in earHer versions of AlttoDcck. 
Both nuethods are mathematically equivalent The 
origmal implementation used 24 multiplications to 
perform each three-dimensional trilinfiar inteipola- 
tion/ but, by cascading seven one-dimensional in- 
terpolations, the number of multiplications has 
been reduced to 7. 



Thirty protein-Jigand complexes with pub- 
lished binding constants were used m. the calibra- 
tion of AxjtoDock's free energy function (Table I), 
and were chosen from the set of 45 tised by Bohm,^ 
omitting all complexes that he mx>deled (i.e., using 
only complexes for which crystallographic struc- 
tures were available). One of the limitations of 
these iDinding constant data is that the conditions 
under which they were determined vary, which 
intrinsically limits the accuracy of our best model- 
We converted between the inhibition constant, JC 
and the observed free energy change of binding, 
AG^^j using the equation: 

where R is the gas constant, 1.987 cal mol"^ 
and r is the absolute temperature, assimied to be 
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room temperature, 298.15 K. Note that tius equa- 
tion lacks a mintis sign because the inhibitian. 
constant is defined for ihje dissociaticm reaction, 
EI ^ E + I,^^ whereas AG^^^ refers to the opposite 
process of binding, E + I ^ EI; where E is the 
enzyme and I is the inhibitor. 

To remove any steric clashes in the crystalla- 
graphic complexes^ each ligand was optimized us- 
ing AuxoDcxk'b new Solis and Wets local mlni- 
mi^ation technique described earlier^ but wiiii 
the previously reported force field/ The separate 
conteibuiions from ttie hydrogen bQndlng,r disper- 
sion/repulfiioriy electros tatic^ and solvation ener- 
gies were evaluated. Empirical free energy coeffi- 
cients for each of these terms were derived using 
hnear regression in the S-PLUS software package,^ 
and cross-validation studies were performed. In 
total, 900 different binding free energy models 
'vvere tested; each linear model consisted of a van 
der Waals term, a hydrogen bonding t erm (one of 
6 variants), an electrostatic term, a torsional en- 
tropy term (one of 5 variants)^ and a desolvation 
term (one of 15 variants). We also investigated 
wheiher the inclusion of a constant term improved 
the modeL Sbc of the seven test systems xised to 
test the docking procedure, which were originally 
used to test AuioDCOc^ version 2A/ were also in 
the training set of 30 protein-ligand compkxes; 
therefore^ to validate the chosen coefficients, linear 
regression was repeated for the set of 24 
proteinr-iigand complexes, excluding the 6 over- 
lapping test eys terns. 

TESTING DOCKING METHODS 

Seven protein-ligand complexes^ with a riange 
of complexity and chemical properties, were cho- 
sen from the Brookhaven Protein Data Bank^' ^ to 
compare the performance of the docking tech- 
niques (see Fig. 2). To facilitate comparison with 
ihe previous force fields we chose the same set of 
six test systems investigated earlier/ but added a 
harder docking problem to challenge all the search 
methods (see Table iD. The simplest test cases 
were the ^trypsin/benzamidine and cytochrome 
P-450^^/camphot complexes, which had smalL 
rigid ligands. Interactions In the former are domi- 
nated by electrostatic interactions and hydrogen 
bonds to ihe substrate amidine, whereas the latter 
is dominated by hydrophobic interactions. McPC- 
603 /phosphocholine and streptavidin/biotin were 
moderately flexible, and represented test systems 
having an intermediate level of difficulty. HIV-1 
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protease/XK263, hemagglutLtun/sialic add, and 
dihydrofolate reductase /methotrexate provided 
more diffixrult tests, with maxiy rotatabla bonds 
and diverse chemical characteristics. 

We compared the performance of Monte Carlo 
sicnulated annealiiig (SA), the genetic algorithm 
(GA)^ and the Lamarckian genetic algorithm (LGA). 
The new empirical free energy fimction presented 
here was used for energy evaluation in all cases. 
Dockings were performed using approximately the 
same number of energy evaluations (— 1.5 mil- 
lion)/ so each method could be judged given simi- 
lar computational investments- The CPU time 
taken for a single docking varied from 4-5 to 41.3 
minutes, on a 2OO-MH2 Silicon Graphics MIPS 
4400 with 128 MB of RAM, depending on the 
number of rotatable bonds and ihe number of 
atoms in the ligand. 

At the end of a set of dockings, ihe docked 
conformations were exhaustively compared to one 
another to determine similarities, and were clus- 
tered accordingly. The user-defined root-mean- 
square positional deviation (rmsd) tolerance was 
used to determine if two docked conformatiQiis 
were similar enou.gh to be included in the same" 
cluster, and symmetrically related atoms in the 
Hgand were considered- These clusters were ranked 
ia order of incceasing energy^ by the lowest energy 
in each cluster. Ordinarily, the structure of the 
protein-ligand complex would not be known, so 
the criteria by which the dockings would be evalu- 
ated are the energies of the docked structures, and, 
in cases where there are several plausible, low-en- 
ergy structures, the number of conformations in a 
conformationally simHax cluster- Because one of 
our goals was to test the ability of the methods to 
reproduce known structures, we also compared 
the rmsd between the lowest energy docked struc- 
ture and the crystaUographic structure- 

DOCKEVG- AND SEARCH -METHOD-SPECIFIC 
PARAMETERS 

The proteins and hgands in the seven docking 
tests were treated using the united-atom approxi- 
mation, and prepared tising the molecular model- 
ing program, Syhvl^ Only polar hydrogens were 
added to the protein, and Kollman united-atom 
partial charges were assigned. Unless siated other- 
wise, all waters were removed^ Atomic solvation 
parameters and fragmental volumes were assigned 
to the protein atoms using a new AxtoDoCk: util- 
ity, AjddSol. The grid maps were calculated using 
AtJTOGRiD, version 3.0. In all seven protein-ligand 
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(a) 



(b) 




C3I, 




(c) 



;"CH3 




H-iC 



(e) 




(f) 



OH 




Cg) 




NH, 



FIGURE 2. The seven ligands chosen for docking, shewing the rptatabie bonds as curiy arrows: (a) benzamidine; (b) 
camphor; (cj phosphocholine; (d) biotin; (a) HIV-1 protease inhibitor XK-263; (fl isopropylated sialic acid; and (g) 
methotrexate. Note that two ligands, (e) and (f), contain hydroxyl rotors, which are not counted in the total number of 
torsional degrees of freedom; note also that cyclic rotatable bonds are excluded. 



cases, we used grid maps with 61 X 61 X 61 points, 
a gxid-point spacing of 0-375 A, and, because tlxe 
location of the ligand in the complex was kaown^ 
the inaps were centered on the ligand's binding 
site, Thie^ ligands were treated in Sybyl initially as 
all atom entities^ ihatis^ aU hydrogens were added, 
then partial atomic charges were calculated using 
the Gasteiger-Marsili method ."'^^ AtnoToKS, an 
AUTODOCK utility^ was used to define the rotatable 
bonds in the ligand, if any, aiui also to unite the 



nonpolar hydrogens added by Sybyl for the partial 
atomic charge calculation. "Ihe partial charges on 
the nonpolar hydrogens were added to that of the 
hydrogen-bearing carbon also in AixroTofis. 

In all three search methods, 10 docldngs were 
performed; in the analysis of the docked conf onha,- 
tions, the clustering toleraiice for the root-mean- 
square positional deviatioan was 0.5 A, and the 
crystallographic coordinates of the ligand were 
used as the reference structure. For all three search 
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TABLE 11. 

X-flay Cry^^l Strvctun^ Coordinate Usdd In Docking BcporimGnte, Their Brookhaven Protoln Date Bank 
AcMMlon CodM and Resolution, Number of Rotatable Bonda In the U^and, Number of Torsional 
Degrees of Freedom, Total Number of Degrees of Freedom, and Energy of Cryotal Structure Uslns the 
Empirical Force ReJd Presented Here. ^ 



Protein-ligand complex 



PDB Resolution 

code CA) Reference 



Number 
of 

rotatable 
bonds 



''lor 



Total 
number of 
degrees of 

freedom 



Energy 
of crystal 
structure 
(kca! mol ~ '') 



^-Trypsin / benzamidine 3ptb 

Cytochrome P-^^BO^^ I camphor 2cpp 

McPC-603 / Phosphocholine 2mcp 

Streptavidin / biotin istp 

HIV-1 protease /XK2e3 ihvr 

Influenza hemagglutinin / sialic acid 4hmg 

Dihydrofoiate reductase / methotrexate 4dfr 



1,7 

1,63 

3-1 

Z6 

1.8 

3.0 

1.7 



69 

70 
71 
73 
75 
76 
79 



0 
0 
4 
5 
10 
11 
7 



0 
0 
4 
5 
6 
7 
7 



7 
7 
11 
12 
17 
18 
14 



-7.86 
-4.71 
4-5.48^ 
-8.86 

-18.62 
-4.71 

-13.64 



^J^^^l^"^"^^ °* torsional degrees of finsadom usad in Ihe calcuJation of the predicted fr^e energy change of Wndlna AG 
Note ttiat thte Mdudes rotalaWe bonds that on»y move hydrogens, such as hydraxyl, amino, and methyl groups 

Thl8 energy \» dominated by a large, poeitiv<e cxjntributlon from C2 and 01 to the IntornaJ nonbonded energy of +6 13 kcal 
nxDi ; these atoms are 2.26 A apart. - 



methods, the step sizes were 02 A for translatians 
and 5* for orientations and torsians. These step 
sizes determined the amount by which a state 
variable could change when, a move is made in 
simulated annealing and the relative size of muta- 
tion in the local search, whereas the a and ^ 
parameters determined the size of the mutation in 
the genetic algorithms, GA and LGA- The Caudiy 
distribution parameters were a = 0 and ^ = 1. 
Note that in simulated annealing, random chaxiges 
were generated by a tmif annly distributed random 
number generator; in the Solis and Wets local 
search, by a normal distribution; and, in the ge- 
netic algorithm, by a Cauchy distribution- In the 
simulated armealiiig tests, the initial slate of the 
ligand was chosen randomly by AtnxDDcoc We 
used ibe optimal set of simulated annealing pa- 
rameters that were determined from the schedule 
experiments described earlier/ These included an 
initial annealing temperature of 616 cal mol""-^, a 
linear temperature reduction schedule/ 10 runs, 50 
cycles, and a cyde-termination criterion of a maxi- 
mum of 25,000 accepted steps or 25,000 rejected 
stepS;. whichfiver came first. The minimum energy 
state was used to begin the next cycle; the only 
exception was for llivr, where the iiutial aimealing 
temperature was increased to 61,600 cal mol~^. 
The maximum iiutial energy allowed was 0.0 kcal 
mol"'', and the maximum number of retries was 
1000, xised to generate a low energy random initial 
state to begin each simulated annealing docking. 



In the GA and LGA dockings, we used an initial 
population of random individuals with a popula-- 
tion size of 50 individuals; a maximum number of 
1-5 X 10^ energy evaluations; a maximum number 
of generations of 27,000; an elitism value of 1, 
which was the number of top individuals that 
automatically survived into the next generation; a 
mutation rate of 0.02, which was the probability 
that a gene woiild undergo a random change; and 
a crossover rate of 0,80, which was the probability 
that two individuals would undergo crossover. 
Proportional selection was used, where the aver- 
age of the worst energy was calculated over a 
window of the previous 10 generations. In the 
LGA dockings, the pseudch-SoKs and Wets local 
search method was used, having a maximum of 
300 iterations per local search; the probabiUiy of 
performing local search on an individual in the 
population was 0.06; the maximum number of 
consecutive successes or failures before doubling 
or halving the local search step size, p, was 4v in 
both cases; aiid the lower boxind on p, the termi- 
nation criterion for the local search, was 0-01. 



Results and Discussion 

CALIBRATION OF EMPIRICAL FREE 
ENERGY FCIVCTIOIV 

Several linear regression models were tested for 
their abiUty to reproduce the observed binding 
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TABLE ill. 

Calibration of Empirical Free Energy FundJonT 



Model* 


Residual 
standard 
error 


Muto'ple 












A 

B 
C 


2.324 
2.232 
2.177 


0.9488 
0.9537 
0,9559 


0,1795 
(0.0263) 

0.1518 
(0.0269) 

0.1485 
(0.0237) 


0.1133 
(0.0324) 

0,1186 
(0.0246) 

0.1146 
(0.0238) 


0.0166 
(0.0625) 

0.0126 
(0.0382) 

0.0656 
(0,0558) 


0.3100 
(0.0873) 

0.3548 
(0-0890) 

0.3113 
(0.0910) 


0.0101 
(0.0585) 

0-1539 
(0.1050) 

0,1711 
(0,1035) 



^n^t^^H^^^^^^ ^ hydrogen bonding term. Model A: full voluma-based solvation term 

t^r^^l Xl^"" ^."^'.?^' ^ ^' ^' aP°iar ligand atoms only In the solvation term, and standard 10^?^ 

a?]!^S"(S^ ^ ^ atoms only in the aoNatlon tern, and the standard 10-12 hydrogen less the aSimSed 

^ Values for the modei coofficlants, wlJh standard deviatione In paremhasea 



canstants of structurally characterized complexes. 
Table HI shows the results for the thi^ major 
candidates^ and Figure 3 shows the correlation 
between, the observed and the predicted binding 
free energies for the 30 proteui--ligand complexes 



in the calibration, set, using the chosen model 
(model C). Model A adds the full volume-based 
solvation method and the torsional restriction term 
to the original molecular mechanics force field. 
Model B simplifies the solvation method by evalu- 




S -I 

-20 -15 -10 -5 

Otwdved binding free tnet^ (kcsl/mol) 

FIGURE 3. Predicted versus obsenred binding free energies for the calibration set and the docking testa. The solid 
line shows a perfect fit, and the dotted lines show one standard deviation above and below this. Hollow diannonds show 
the 30 protein-ligand connplexes used In fitting the terms of the binding free energy function. Solid triangles show the 
results of the simulated annealing (SAJ dockings, solid diamonds show the genetic algorithm (OA) dockings, and the 
solid squares show the Lamarckian genetic algorithm (LGA) dockings. Note the outlying biotin-streptavidin complex 
dstp), where it is believed there are signfflcant contributions to the binding free energy due to protein rearrangements. 
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atiivg the volixtne buried far only the carbon, atoms 
in the ligand. Model C also uses only Hgand car- 
bon atoms in the desolvation calculation, and also 
adds a constant term to €t\e hydrogen bonding 
funcHon, modeling desolvation of polar atoms. 
Model C was chosen for incorporatian into Axrro- 
DOCK 3.0, based on its better overall statistics, and 
on. criteria discussed in what follows. The form of 
this free energy function is: 



AG « AG 



vdW 



/ A 



5 



\ 



11 

.6 

V J 



i 



42 



+ E 



hliond 



toT*^tor 



(2) 



where Ehbond ^ estimated average energy of 
hydrogen bonding of water with a polar atomy and 
the summation in the solvation term is perf ormed- 
over all pairs consisting of only carbon atoms in 
the ligand, i, and atoms o£ all types, in the 
protein. Note that the internal or intramolecular 
interaction energy of the ligand is not included in 
the calculation of binding free energy; during 
docking, however^ Internal energy is included in 
the total docked energy, because changes in ligand 
conformation can affect the outcome of the dock- 
ings so this must be taken into consideration. We 
looked at linear regression models that did include 
the internal energy, and found that addii\g this 
term did not improve the model. The asstimption 
made is that^the internal energy of the ligand in 
solution and in tt\e complex are the same. The 
energies used and reported by AUTdDcck should 
be distinguished: there are docked energies, which 
indude the intermolecular and intraroolecular in- 
teraction energies, and which are used during 
dockings; and predicted free energies, which include 
the intermolecular energy and the torsional free 
energy, and are only reported at the end of a 
docking- Because the intermolecular energy grid 
maps include tiie desolvation term, dockings using 
the new, empirical force field in AuioDOQC ver- 
sion 3-0 may be qualiiatively different from results 
found using earlier versions ► 

Three coefficients, for dispersion/repulsion, 
electrostatics, and loss of torsional freedom were 



very stable in the linear regression analysis, with 
consistent coefficient values in different formula- 
tions and reasonable standard deviatLons. In our 
best model, dLspersian/repulsion energies, with 
parameters taken from Amber,^ were weighted by 
a factor of 0.1485, yielding an energy of about 
— 0^ kcal mol"-^ for the mDst favorable atom-atom 
contacts. Electrostatics, modeled with a screened 
Coulomb potential,^ were weighted by a factor of 
0.1146, yielding an energy of about —1,0 kcal 
mol""'- for an ideal salt bridge. In the torsional 
restriction term, each torsional degree of freedom 
requires 0.3113 kcal mol~^v 

The major diEferences between models occurred 
with the interaction of the hydrogen bonding term 
and the desolvation term, Hydtogen bonding is 
modeled with a dicectional 12-10 potential.^ We 
encoimtered a major problem when calibrating this 
hydrogen bonding function. Because the test set 
included only natural enzyme-ligand complexes, 
optimized by nullions of years of evolution, hydro- 
gen banding groups in the ligands are nearly al- 
ways paired with the appropriate hydrogen bond- 
ing group in the protein. Thus, tive number of 
hydnogen bonds that the ligand forms in the com- 
plex and the number it forms witiv solvent when 
free in solution are approximately the same; that 
is, there is little change in the free energy of 
hydrogen bonding, and AG^bond evaluated to 
be approximately zero. Uiifortunately, this pro^ 
vides lU) information on the cost of burying a 
hydrogen bonding group without forming a bond 
with the protein, and our data set did not include 
cases to evaluate this- Of course, the volixme-based 
solvation method shoidd account for this — the un- 
favorable polar contribution to the solvation en- 
ergy should compensate for the favorable 12-10 
hydrogen bonding energy. The linear regression, 
however, consistentiy returned coefficients that set 
the hydrogen bonding energy and desolvation en- 
ergy to nearly zero, and increased the dispersion/ 
repulsion term to compensate (see Model A in 
Table TED- We chose an alternative formulation to 
resolve this problem- 

We obtained the best results by separating the 
desolvation of polar atoms from the volume-based 
calculation. We assumed that the extent of hydro- 
gen bonding in the complexes was roughly the 
same as the extent of hydrogen bonding in solu- 
tion. The calculated hydrogen bonding energy, us- 
ing the directional 12-10 hydrogen bonding func- 
tion, was divided by the maximal number of 
possible hydrogen bonds, counting two for each 
oxygen atom and one for each polar hydrogen. For 
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tihe 30 complexes used in calibration, it was found 
that 36% of the maximum possible hydrogen 
bending sites were actually utilized. Values of 
Ei^^aRd ^^gi^g from 36% to 100% of the mavimal 
weU depth of 5 kcal mol"^ had little effect on the 
success of the formulation (data not shown), and a 
value of 36% was chosen- Optimized weights 
yielded an ideal hydrogen bonding energy in the 
complex of —0328 kcal mol"^, and the estimated 
average energy of each hydrogen bond in solution 
of —0*118- Because hydrogen bonding was mod- 
eled by this difference, the typical hydrogen bond- 
ing free energy of a compile was approximately 
zero, but there was a penalty of about 0*2 kcal 
mol~^ for oxygen and nitrogen atoms that did not 
fonn hydrogen bonds, driving the simulation to- 
ward docked conformaiions with maximal hydro- 
gen bonding. We are currently exploriog an appro- 
priate data set for evaluating this formulation more 
rigorously. 

The desolvation for carbon atoms in the ligand 
was evaluated using two different classes of atom 
type, aliphaiic and aromatic, as in the original 
study.^ The desolvatlon term was weighted by a 
factor of 0.1711 in our final empirical free energy 
force field, so a typical aliphatic carbon atom yields 
an energy of about — 0-2 kcal mol~^ upon binding. 

We cross-validated the free energy model in 
two ways. First, we irivestigated the influence of 
each member of the training set on the final coeffi- 
cients of the model, by removing each one from 
the training set and calculating the coefficients 
from the remaiaing 29 complexes. We found that 
none of them had a strong effect on the final 
values of the coefficients. 

We also performed a second kind of cross-vali- 
dation of the free energy model, by performing 
Solis and Wets local search using AxrxoDoCK and 
the new free energy function, starting from the 
x-ray crystallographic conformations of each in- 
hibitor in 20 HIV-1 protease-inhibitor complexes, 
to compare the resulting optimized conformations' 
predicted free energy change of binding, AG^^^dsA^^ 
with the experimentally determined values. These 
protease inhibitors were quite different, having 
from 7 to 28 torsions, and widely different side 
^^hains — charged^ polar, and hydrophobic, and 
constituted a diverse test set As can be seen from 
the results in Table IX, the correlation was very 
good, with an overall rmsd between the experi- 
mental and calculated values of AG-^-^^i^^^ of 1-92 
kcal mol"^. 

The final form of ttie free energy function may 
seem overparameterized, with additional weight- 



ing parameters added to a previously optimized 
parametenzation. We retained the molecular me- 
chanics formulation, however, specifically for its 
ability to model the distance dependence of each 
etiergetic term. This distance dependence (and an- 
gular dependence in hydrogen banding) Is essen- 
tial for finding vaHd docked canformations, but 
the amount aiui resolution of the available pro- 
tein~ligazid data do not support a full re-pameteri- 
zaiion of the functior\s . 

DOCKIIVG EXPERIMENTS 

Because we are compaiing different search 
meihods, it is important to ensure that the meth- 
ods are treated equally. It Is therefore important 
that each search method be allowed approximately 
the same number of energy evaluations in a dock- 
ing. The number of energy evaluations in a dock- 
ing depends on the termination criteria, and be- 
cause it is not possible to predict how many 
accepted or rejected steps the stochastic S A method 
will make at a given temperature, the number of 
evaluations varies in SA- The range was frorn 
1.19 X 10^ to 2.33 X 10^, depending on the pnv 
tein-ligand test system, even though the same 
parameters were used for the mmrvber of cycles, 
accepted steps and rejected steps. In the case of the 
GA dockings, the population was 50 and the num- 
ber of gerierations was 27,000, which gave a total 
of 1.35 X 10^ energy evaluaticms in a docking; 
thus, the GA dockings were terminated by reach- 
ing the maximum number of generations. In the 
case of the LGA dockings, 6% of the popxilation 
imderwent Lamarckian local search, each search 
consisting of 300 iterations and each iteration us- 
ing an extra energy evaluation. Thus, the LGA 
dockings, even with the same population size and 
number of generations as the GA, were terminated 
by reaching the raaximum number of energy eval- 
uations, 1.50 X 10^ 

The results of the simulated annealing;, genetic 
algoritiim, and the Lamarckian genetic algorithm 
docktng experiments are summarized in Tables IV, 
V and VI, respectively. The lowest energy docked 
structure ibund by each method is compared with 
the crystal structure of the ligand in Figure 4- The 
predicted change in free energy upon binding, 
AGp^ed' ^ lowest energy found by LGA is 
shown in Table VH, along with the experimentally 
observed change in free energy upon binding, 
AGobs- In addition, the breakdown of the energy of 
the lowest energy docked conf oramtion is shown, 
in terms of the intermolecular interaction energy. 
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TABLE W. 

Re»ult« of Sfmulated Annealing Etoddng^* " " " 



^ Energy (kcai mol-"*) and rmsd (A) 

rmsd 

Number Number of 
PDB of In Lowest lowest Mean Mean 

code clusters rank 1 energy energy energy rmsd 



3ptb 


5 


6 


-8.03 


0-21 


-7.B4 


(0.08) 


0.50 


(0,17) 


2.01 X 10^ 


2cpp 


4 


6 


-7.29 


0.81 


-7.22 


(0.03) 


0.91 


(0,30) 


2,33 X 10® 


2mcp 


10 




-4.09 


0-88 


70.89 


(2.10 X 10=) 


5.40 


(4.80) 


1.85 X 10® 


1stp 


10 




-8.46 


1.27 


-7.71 


(0.66) 


1.24 


(0.35) 


2.00 X 10® 


1hvr 


10 




-11.77 


1.15 


1,12 X 10^ 


(3.36 X 10^) 


6.13 


(2.61) 


1.19 X 10^ 


4hmg 


10 




-2.59 


3.77 


6,99 X 10* 


(1,52 X 10^) 


6,20 


(2.94) 


1.55 X 10^ 


4dfr 


10 




-8.73 


4.83 


6.13 X 10* 


(1.96 X 10^) 


5.04 


(1 .74) 


1,30 X 10^ 



The parametare used were 10 runs, 50 cycies, and a cycle-termlnaiion cxfterion of 25.000 accepted steps or 25,000 rejected 
stops, whldiever came first The rnnod conformatfonal dustering toferanca vwas 0.5 A. calculated from the ligand's crystallographic 
coordinates. Standard deviations grvan rn parenthases. 



AGi^t^ ^ intramolecular cnjergy; AG-^^, and ihe 
torsional free energy, AG^^, These results are dis- 
cussed case-by-case in what follows; "crystallo- 
graphic rmsd"' refers to the roofc-mean-square posi- 
tional deviation of a given conformation from the 
crystallographic coordinates. 

P-Trypsia/Bonzamidinc (3ptb) 

The recognition, of benzamidine by ^trypsin, 
which binds tightly in the specificity pocket of 
trypsin^ is chiefly due to the polar amidioe moiety 

TABLE V. 

Result« of Genetic Algorttlim Dockings." 



Energy (kcal mol" '') and rmsd (A) 



PDB 
code 


Number 
of 

dusters 


Number 

in 
rank 1 


Lowest 
energy 


rmsd 

of 
lowest 
energy 


Mean 
energy 




Mean 
rmsd 




Number of 

energy 
evaluations 


3ptb 


2 


9 


-8.17 


0.32 


-7.72 


(1.35) 


1.50 


(3.39) 


1.35 X 10* 


2cpp 


4 ^ 


7 


-7.36 


0.93 


-6-65 


(2.11) 


2.18 


(3.42) 


1.35 X 10® 


2mcp 


10 


1 


-5.17 


0.85 


-3-61 


(0.95) 


5.26 


(2.98) 


1.35 X 10® 


lsip 


7 


4 


-10.09 


0-75 


-8.42 


(1,82) 


2.96 


(3.04) 


1.35 X 10® 


Ihvr 


7 


4 


-21,41 


0.82 


-11.09 


(9.79) 


2.79 


(1.97) 


1-35 X 10® 


4hmg 


9 


2 


-7.60 


1.11 


-5.72 


(1.77) 


2.32 


(1,43) 


1.35 X 10^ 


4dfr 


10 


1 


-16.10 


0.95 


-10.24 


(3.95) 


4.39 


(2.37) 


1.35 X 10^ 



■The parameters used were 10 runs a population eizs of 50, and a rur>-termlnalion criterion of a maximum of 27.000 generations or 
a maximum of 1.5 x 10^ energy evaluations, whichever came first ^4atB that, In this case all runs temiinated after the maadmum 
number of genefstions was reached, which equals the product of the population size and the number of generations. The rmsd 
confomiaijonaJ clustering ioteranc© was 0.5 K calculated from the llgarKl's crystallographic coordinates. Standard deviations ere 
gfven in parentheses. 



Number of 

energy 
evaluations 



and the hydrophobic benzyl ring.*^ The amidioe 
moiety was treated as being protonated. It was 
assmned that delocalization of the TTM^lectrons of 
the benzene ring e^rtended to the Tx^system of the 
amidine, and thus the ligand was treated as a rigid 
body. All three search methods succeeded in find- 
ing lowest energy conformations that were also the 
ones with ttie lowest crystallograpiiic rmsd. In this 
case, the method that found the docked structure 
with the lowest energy was GA (Table V)^ but that 
found by the LGA method CTable VD was practi- 
cally the same. The mean of the final docked 
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TABLE VI, 

Results of Umivr^an Genetic Algorithm E>ockJ[ngs.^ 



Energy (kcal mo|-^) and rmsd (A) 



PDB 
code 



NJumber 
of 

clusters 



Number 

rn 
rank 1 



Lowest 
energy 



rmsd 
Of 

lowest 
energy 



Mean 
energy 



3ptb 

2cpp 

2mcp 

1stp 

1hvr 

4hmQ 

4dfr 



1 
1 
6 
1 

2 
3 
2 



10 

10 

2 

10 

9 

7 

7 



-8-15 
"7.36 
-5.54 
-10.14 
-21.38 
-7.72 
-16.98 



0.45 
0.93 
1.05 
0.69 
0.76 
1.14 
1.03 



-8,15 
-7,36 
-4.15 
-10,06 
-19.11 
-7.54 
-16.90 



(0.00) 
(0-00) 
(0.15) 
(0.05) 
(6.92) 
(0.19) 
(0.07) 



Mean 
rmsd 




Number of 

energy 
evaluations 


0-46 


(0.01) 


1,50 X 10^ 


0.93 


(0.00) 


1.50 X 10^ 


1.10 


(0.07) 


1.50 X 10® 


0-66 


(0.06) 


1,50 X 10^ 


0.85 


(0.35) 


1.50 X 10® 


1.18 


(0.12) 


1.50 X 10* 


0.96 


(0,07) 


1.56 X 10^ 



The parameters used were 10 rune, a population size of 50, and a rurhtermlnatlon criterion of a maMmum of 27.000 generartlons 
or a masdmum of 1.5 x 10*= eneray evaJuations. wnJchaver came flret Because locaJ ssarcfi also uses energy evaiuations the totaJ 
number of energy evaluations far the LGA method wa« greater than that forihe GA method, using the same population 'size and 
maximum number of generations; In the U3A dockings, the mns temiJnatad because Hie maximum number of erwrgy evaluations 
was Gxceadad, The rmsd conformational clustering toieranc© was 0.5 A, calculated from the ilgand's crystallographlc coordinates. 
Standard deviations are given in parentheses. 



energy across the tea dockings was lowest for 
LGA, foHorwed hy SA, and finally GA. This is 
reflected in a conaparison of ihe mean rmsd of the 
docked confarmation from the crystallographlc 
stmctuxe for each of fhe methods: GA had the 
highest m^an rmsd, followed by SA, and on avec- 
ase> the LGA produced conformations with the 
Lowest crystallographlc rmsd. Thus, considering 
their average performance/ the best search method 
at finding the lowest energy and the lowest rmsd 
was the LGA* The predicted binding free energy, 
AG^^, of the lowest docked energy structure ob- 
tained usiixg the LGA meihod was -8,15 kcal 
mol"^ (Table VII), whereas the observed value, 
^Gotft —6.46 kcal mol"^: this is within the 

estimated, error of the model. 

Cytocbronie ^-450^.^ / Camphor (2cpp) 

Camphor binds to the monooxygenase cy- 
tochrome F-450^in such that the C — bond 
is hydroxylated stereospecifically. The active site is 
deeply sequestered within ttve enzyme, and the 
crystal structure of the complex does not possess 
an obvious substrate access chaimel.^ This buried 
active site presmts a more challenging docking 
problem than 3pfb. Once bound, however, the 
substrate is "tethered" by a hydrogen bond that is 
donated from the Tyr-96 hydroxyl to the carbonyl 
oxygen of camphor, while the subtle complemen- 
tarity of the pocket and the hydrophobic skeleton 
of camphor help to position the rest of the sub- 



strate. The lowest energy found was — 736 kcal 
mol~^ found by both the GA CTable V) and LGA 
methods (Table VD; SA's lowest energy was —7.29 
kcal mol"^ (Table IV), which is practically the 
same. AH methods found the crystallographic 
structure, SA succeeding in 9 of 10 docktogs, GA 
in 7 out of 10 dockings, and LGA in all of the 
dockings (with success, once agairv being mea- 
sured as having a crystallographic rmsd of less 
than 1 A). In all three search method casesy the 
lowest energy cluster was the most populated^ 
with 6, 9, and 10 members using SA, GA, and 
LGA, respectively. The predicted binding free en- 
ergy, AGp^^ji, of the lowest docked energy struc- 
ture, was — 7.36 kcal mol^^ using the LGA method 
(see Table VTD^ whereas the observed value, AG^^^g, 
was --8-27 kcal mol"-'- — once again, this was 
within ttie estinnated error of the modeL 

McPC-603/Phodp)ioclioltiic (2mcp) 

Antibody molecules bind their target antigens 
with exquisite specificity, having close comple- 
mentarity between antigen and antibody surfaces, 
hydrogen bonding, van der Waals, and electro- 
static interactions. Fhosphocholine binds to Fab 
McPC-SOS/'^ and is an example of recognition is 
predominantly electrostatic in character, primarily 
due to the influence of Arg H52-^ There is little 
conformational change in the side chains of Pab 
McPC-603 upon binding, as indicated from the 
imbovmd crystal structure. We allowed aU four 
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FIGURE 4. A comparison of the lowest energy structure found by each search method and the crystal structure. The 
latter is shown in black. The sfnr^ulated annealing results are rendered wrth a striped texture, the genetic algorrthm 
results are stiaded gray, and the Lamarckian genetic algorithm results are while. Oxygen atoms are shown as spheres; 
other heteroatoms are not shown. Note that simulaied annealing tailed in the last two test cases, 4hmg and 4dfr» but 
both the genetic algorithm and the Lamarckian genetic algorithm succeeded. 



bonds to rotate during docking. The energy of the 
crystal structure was positive, due in most part to 
a large, positive internal energy dominated, by C2 
and Ol being too dose (2.26 A); this could be 
improved if Icxal minimization had been per- 
formed on the crystal structure before docking. 
The lowest energy found by each of the three 
search methods were —4.09, —5.17, and —554 
kcal mol"-^, using SA, GA, and LGA, respectively. 
Unlike 3ptb and 2cpp, these differences in en^gy 



were more significant. Both SA and GA found 10 
different dusters^ whereas LGA found 6 dusters. 
The mean energy of the 10 dockings was +70.89, 
—3.61, and —4.15 kcal mol""^ for SA, GA, and 
I-GA, respectively. Thus, on average, the LGA per- 
fonrued best in finding the lowest energy docked 
structure- Furthfiimore, the mean rmsd from the 
crystallographic coordinates was 5.40, 5.26, and 
1.10 A for SA, GA, and LGA, respectively, indicat- 
ing that LGA also reproduced the crystal structure 
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TABLE VM. 

Comparison of Predicted Free Energy of Binding, AGp^^, of Lowemt Energy Docked Stnichire Obtained 
Using Lamarckten Genetic Algorithm, and Obae rvcd Free Energy of Binding, AGa^,," 

Energy (kcal mol" and rmsd (A) 



PDB 

code 


Lowest 
energy 


rmsd of 

lowest 

energy 














3ptb 


-8.15 


0.45 


-8.15 


0.00 


0.00 


-8,15 


-6.46 


-1.69 


2cpp 


-7.36 


0.83 


-7-36 


0.00 


0.00 


-7.36 


-8.27 


+0.91 


2mcp 




1-05 


-6.57 


+ 1,03 


+ 1^5 


-5.32 


-7.13 


+ 1.81 


Istp 


-10.14 


0-69 


-9,90 


-0.24 


+ 1.56 


-a.34 


-18.27 


+9.93^ 


1hvr 


-21,38 


0-76 


-19,34 


-2.04 


+2.48 


-16.85 


-12.96 


-3.89 


4hmg 


-7,72 


1.14 


-8.93 


+ 1.21 


+2.18 


-6.75 


-3.48 


-3.27 


4<ifr 


-16.97 


1.03 


-16.57 


-0.40 


+2,18 


-14,39 


-13.22 


-1.17 



*^Ginw ^ ^ IntermoJecular interoction energy between the ilgand and the naceptDr. AGmira the Intramolecular Interaction 
energy of the llgand. and AG^^,, Is the tDrsksnal free energy change of the llgand upon funding. 

^ This large di8Cfaparu:y may be due to neglect of the conformational roafrangemente of streptHvndin upon WfKilng blotin, which are 
neglectad In the docking simulation and binding free energy caiculatton. 



most often. The predicted binding free energy, 
A(^g^, of tiie lowest docked energy structure was 
—531 kcal xnDl~^ using ttue LGA method (see 
Table VTO, whereas the observed value, AG^bfl, "w-as 
"7.13 kcal mor^— tttis was also within the esti- 
mated error of the model. 



Streptavidfin / BloUa (Istp) 

One of the most tightly binding noncovalent 
complexes is that of streptavidin/biotiiv with an 
experimentally observed dissociation constant 
of 10 ""^^M. Comparison of the apo form and the 
complex^ shows that the high affinity results from 
several factors^ inclxiding; fonnatian of multiple 
hydrogen bonds and van der Waals interadionLS 
between the biotrn and the protein^ in addition to 
the ordering of surface polypeptide loops of strep- 
tavidin upon binding biotiit The method that 
fotmd the lowest energy was IGA, at --10.14 kcal 
mol"'^, although GA was not significantly differ- 
ent, followed by SA with -8.48 kcal mol-\ TtW 
method with thie lowest mean energy was LGA at 
-10,06 kcal mol'S then GA with -8.42 kcal 
inol"^, and finally SA with -7.76 kcal xnol^^ The 
method that found the crystalksgraphic complex 
coordinates most often was LGA, having a mean 
rmsd of 0.66 A, then SA at 1^4 A, and finally g!a 
with 2-96 A^ At the nnsd toletance chosen for these 
expeiiments, 0.5 A, SA found 10 different confor- 
mational clusters, GA found 7 dusters (the most 
populated was raxvk 1, with 4 members), and LGA 
foxmd 1 cluster. ' 



It was not possible to include the entcopic ef- 
fects of the flexible surface loops of streptavidia hx 
the docking of biotin, although they make signifi- 
cant contributions to the bixuling free energy as" 
revealed by a recent set of experiments involving 
an atomic force rriicroscope/^ It was found that the 
imbinding forces of discrete complexes of strepta- 
vidin with biotin analogs were proportional to the 
enthalpy change of the complex formation but 
independent of charvges in the free energy, which 
indicates ttiat the unbinding process is adiabatic 
and that entropic changes occur after uxibrcidiitg. 
This may help to explain why the predicted bind- 
ing free energy of the streptavidin/biotin complex 
(AGp,^) —10.14 kcal mol"^^ underestaroated the 
magnitude of the observed value (AG^^g) —18.27 
kcal mol-^ (Table VH). 



HIV-1 Protease /XK263 (Ihvr) 

HtV-l protease inhibitors prevent the matura- 
tion of virions of HIV, and are a major target for 
computer-assisted drug design in the development 
of AIDS therapies. Substrates and inhibitors of 
HIV-1 protease are typically extended peptides or 
peptidomimetics, with a dozen or more freely ro- 
tatable bonds and, as such, they present a chal- 
lenging target for automated docking techniques. 
In addition, considerable protein xnotion is ex- 
pected in the flaps upon binding, to allow the 
continuous polypeptide to reach the active site. 
However, most docking methods use a rigid pro- 
tein, target, and explicit modeling of the opening 
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and closing of the flaps is not performed: thus, the 
ligand must "fhread'' its way into the active site, 
Tlie cyclic urea HTV-protease inhibitor, XK-263, 
has 10 rotatable bonds, exduding the cyclic urea's 
flexibility. All three search methods found solu- 
tions near to the crystal structure'^: interestingly, 
the lowest docking energy found by SA was 

— 11.77 kcal mol"^ and had an rmsd of 1.15 A 
from the cryBtaL structure, whereas GA and LGA 
fotmd much lower energies but were stUl near to 
the active site, having crystallographic nnsd val- 
ues of 0^2 A and 0.76 A, respectively. The lowest 
docking energy found overall was —21.4:1 kcal 
rool"-"^, and was foimd using GA, although fhat 
found by LGA was practically the same. The pre- 
dicted binding free energy, ^G^^^^. of the lowest 

energy structure was —16-85 kcal mol"^ using 
LGA, whereas the observed value, AG^ta^ was 

— 12-96 kcal mcl~^. The larger discrepancy be- 
tween the predicted and observed values may be 
due to the entropic contributions of protein aide 
chain and flap conformational rearrangemjcnts, or 
may be due to other low-energy conformational 
states of the cyclic urea moiety of XK-263, which 
are neglected in our calculations. 

Influenza HemagglutiiiitL/ Sialic Acid (41img) 

The recognitiQn of sialic acid by influenza 
hemaggluttnin is chiefly mediated through hydro- 
gen bonding: sialic acid has five hydroxyis, three 
in the glycerol group, one carboxylate, a cydic 
ether oxygen^ and an acetamido group, with a total 
of 11 rotatable acyclic bonds. We used the crystal 
structure of Weis et aL/^ although the low resolu- 
tion meant that the overall coordinate error was 
approximately 035-OM A, which, in iiselE^ pre- 
sents a potential challenge in the docking tests. We 
miodeled an isoptopylated derivative of sialic add 
to mimic part of an adjacent six-tnembered ring 
that would normally be present in this complex, 
but was not seen due to disorder this introduced 
an extra rotatable bond/ giving a total of 11 tor^ 
sions. Furthermore, in these tests, we used the 
crystal conformation of the six-membered ring, 
although normally we would use several of the 
lowest energy conformatians of the ring system 
and dock -ftvese separately. 

This was one of two cases where simulated 
annealing failed to* find a docking that was near 
the crystal structure: the lowest energy structtire 
found had an rmsd of 3.77 A from the crystallo- 
graphic structure, and the docking with the lowest 
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crystallographic rmsd was 2.36 A. The mean en- 
ergy of all 10 SA dockings was very high (6.99 x 
10* kcal mol"^); only 4 of the 10 SA dockings 
found negative energies- Both GA and LGA, how- 
ever, succeeded in finding conformations near 
1^ A nrvsd) ttie crystal conformation. The low- 
est energy fburul was hy LGA, aiui was — 7,72 kcal 
mol"^. This structure had a crystallographic rmsd 
of L14 A, and had a predicted binding free energy, 
AGpj^, of —6^75 kcal mol"^; the observed binding 
free energy, AG^^, for the sialic acid-hemagg- 
lutinin complex was —3.48 kcal mol"'-. The differ- 
ence in predicted and obserral brndiag free ener- 
gies may be due to the structural differences 
between the isopropyl ated derivative that wa s 
docked and sialic acid itself. 

Diliydrofolate Reductase /Methotrexate 
(4dgr) 

Methotreacate is an antimetabolite that attacks 
proliferating tissue selectively induces remissions 
in certain acute leukemias^; however, dangerous 
side effects of methotrexate in normal cells con- 
tinue to make DHFR an important target in com- 
puter-assisted anticancer drug design.'^ We used 
the crystal structure of E. colt dihydxofolate reduc- 
tase complexed with methotrexate'^ to investigate 
a more challenging docking problem- We assumed 
that waters 603, 604> and 639^ which mediate hy- 
drogen bonding between the inhibitor and the 
protein, were conserved biowaters, and included 
them in the protein structure in our grid calcula- 
tions- Ideally, these should be predicted, and re- 
cently a method hased on a Jt-nearest-nei^bors 
classifier and a genetic algorithm called Consolv 
was reported to do just this.^ 

This is one of the two test cases where simu- 
lated annealing failed: the lowest energy structure 

that it found had an rmsd of 4,83 A from the 
crystal structure- This coxald be because the final 
docked conformation in simulated annealing is 
arrived at after a series of continuous steps, and if 
the route to the active site is blocked, the docking 
will tend to fail before the Ugand reaches -ftve 
active site. Note that, in the case of the camphor- 
cytochrome P^O^ docktr\g, the random initial- 
ization loop was able to find initial states that were 
inside the binding pocket, hut in this case the 
dockings failed to start near the active site. 

The lowest energy found was —16.98 kcal 
mol~^, and was found using LGA: this structure 
had an rmsd from the crystal structure of 1,03 A. 
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The predicted binding free energy, ^C^ed' 
lowest docked energy structure was —14.39 kcal 
mol'"^ U3ing LGA, whereas the observed value, 
AGjj^jft, was —13^ kcal mol"^ This finding was 
within the estimated error of the model. 

JUDGING SEARCH METHODS 

To evaluate the new search methodS/ and to 
compare them with the earlier search method of 
simulated annealing, we addressed the following 
questions: Which search method is most efficknt? 
That is, which finds the lowest energy in a given 
number of energy evaluations? Which search 
method is most relmblef That is, which method 
finds the most conformations similar to that of the 
lowest energy? Finally, which search method is 
most Quccessfull That is, which finds the crystallo* 
graphic conformation most often after a given 
number of dockings? Furthermore^ because these 
comparisons were carried out using the new, em- 
pirical free energy force field, these tests also rep- 
resent an evaluation of the force field itself, and/ if 
the global minimum of the force field is tmable to 
reproduce observed crystallographic structures, its 
usefulness will be limited- Because it is very diffi* 
cult to determine the global minimum of such a 
complex function, we cannot answer this questian 
definitiveLy; however, we can report the lowest 
energy found by any of the methods and its struc* 
tural similarity to that of the crystal structure. 

If we calculate statistics across all seven pro- 
tein— ligand test systems for each search method, 
we obtain a quantitative estimaie of relative per- 



formance of each search method (see Table VHD. 
If, in each of the seven test systems, we assume 
that the lowest docked energy found by any 
method is the effective global rrunimiim energy, 
and then calculate the difference between this en- 
ergy and all of the docked energies found by each 
search method^ we can then calculate the mean 
and standard deviation of this difference energy 
for each search method. Ideally, the mean and 
standard deviation of this value would be zero. 
The mean of this difference energy was lowest for 
LGA (0.40 kcal mol"^), followed by GA (3.41 kcal 
mol-^), and finally SA (2,62 X 10^ kcal mol"^): 
the very high mean difference energy for SA is 
indicative of the cases in which this method failed 
to escape a local minimTtm^ where the ligand 
was partially or wholly trapped within the pro- 
tein- Hence, in answer to the first question/ the 
Lamarckian genetic algorithm, LGA, is the most 
efficient search method- 

In terms of how often the structure with lowest 
energy was found, LGA performed best; the mean 
of the number of docked structures in rank 1 was 
78% for LGA, 40% for GA, and 24% for SA. The 
mean of the number of dusters found was lowest 
for LGA (2.23\ followed by GA (7.00), and finally 
S A (8.43)- Hence, the most reliable search method 
was LGA. 

In comparing the relative success of each search 
method in reproducing the crystallographic struc- 
ture, considering the crystallographic rmsd across 
all 10 dockings in each of the 7 test systems, the 
mean rmsd was lowest for LGA (0,88 A, standard 



TABLE VIII =r-T^— : ; 

Statistical Compariaon of Three Search Methods In Air roDoac 3.0 Acroso all Sovan Test Sygtema/ 

Energy (kcal mo[~^) and rmsd (A) 



Search 

method Siai^sfTic 



Number 
of 

clusters 



Number 

in 
rank 1 



Difference from 
effective global 
minimum energy 



rmsd 

of 
lowest 
energy 



Mean 
energy 



SA 
GA 
LGA 



Mean 
SD 

Mean 
SD 

Mean 
SD 



8.43 
2^70 
7.00 
3.06 
2-29 
1.80 



2.43 
2,44 
4.00 
3.06 
7.86 
2.91 



2.62 X 10^ 
1.40 X 10^ 

3.41 

5.31 

0.40 

2.62 



LBS 
L74 
0.62 
0,25 
0.66 
0.24 



2.61 X 10^ 
4,60 X 10* 
-7.S4 
2.59 
-10.47 
5.47 



Number of 
Mean energy 
rmsd evaluations 



3.63 1-75 X 10^ 

2-61 4.15 X 10^ 

3.06 1.35 X 10® 

1.32 0.00 

0.68 1.50 X 10^ 

0.25 0-00. 



" The saarch metho<l5 are simulated anneaJing (SA). genetic algorithm (C3A), and Lamarcidan geneUc aJgorithm (LQa). The mean 
and standard dflwiation (SD) for each criterion te sh<7wn. The effective global minimum energy for aach of the sev«i Test ^ 
the lowest docked energy found by any method for that test system. For each d ihe 10 dockings, !he difference between the final 
docked energy and ihis effecttv© global minimum energy was calculated; Ihe rnean and standaid deviation was calculated across 
ail 7 test systams, which was repeated for each search meihod. 



TABLE DC 

Resutts of Croas-Validatlon of Free Energy 
Function Using Local Search on 20 HlV-1 
Proteaso-lnhlbtor Complexes. 



PDB 
code 


EKpetimental 

^^blndlng 

(kcal/moD 


Calculated 
(kcal/moD 


1hvs 


-14.04 


— — — — — , 

-10.95 


1hvk 


-13,79 


-11.60 


1hvi 


"13.74 


-12-39 


7hvp 


-13.11 


-12-19 


1hps 


-12.57 


-11,80 


1hpv 


-12.57 


-8-24 


4phv 


— 1 iiJOl 


i jI net 


1hef 


-12.27 


-9.52 


Ihiv 


-12,27 


-13.02 


1hvl 


-12^7 


-10-35 


8hvp 


-12.27 


"9-36 


laaq 


-11-62 


-9.68 


Ihtg 


-11-58 


-13.13 


9hvp 


-11.38 


^10.54 


Ihih 


-10.97 


-11-43 


Iheg 


-10.56 


-8.60 


1sbg 


-10.56 


-10-35 


Ihtf 


-9.31 


-8-21 


1hbv 


-8.68 


-9.75 


1hte 


-7.69 


-7.28 



deviation 0^ A), followed lay GA (3.06 A, stan- 
dard deviation 132 A)^ and finally SA (3.63 A, 
standard deviation 2.61 A). Tlvese average results 
indicate tiiat, of ttie tiitee seardi methods, LGA 
will find the crystallographic structure most often, 
Thus^ the answer to the last question^ "Which 
method is most successful?/^ is LGA 

In two different cases, 4hmg and 4dfr; the simu- 
lated annealing method failed to reproduce the 
coitesponding crystal structure, although it suc- 
ceeded wi& Ihvr (see Fig. 4). This is important 
because methotrexate has 7 rotatable honds, and 
would he expected to he solvable using our rule- 
of-thumb that SA succeeds in problems with 8 
torsions or less; however, the HIV-l protease in- 
hibitor XK-263, has 10 rotatable bonds, and was 
successfully docked using SA. Thus, the degree of 
difficulty of a docking problem is not as simple as 
how many rotatable bonds there are; other factors, 
such as the nature of the energy landscape, deaily 
play an important role. 

It could be said that the crystallographic rmsd 
of the lowest energy structure found by any of the 



AUTOMATED DOCKING 

search methods is an estimate of the quality of the 
force field, although this is complicated by the fact 
that the search method itself must determine a 
docking near to the global jninimuriVy an unknown 
state. We can calcudate the energy of the ligand in 
the oystal structure using the new force field (see 
Table HX which we assume to be near the global 
minimum, but, unfortimately, the cxystal structure 
may contain frustrations and bad contacts. This 
appears to be the case in 2mcp, where a close 
contact between C2 and Ol causes a positive total 
energy to be calculated for the crystal structure- In 
aU cases, the lowest energy found, considering all 
the search methods, was lower than that of the 
corresponding crystal structure. 

The crystallographic rmsd of the lowest energy 
(found by any search method) for each of the 
prolein-ligand test systems were all within 1.14 A, 
or less, of the cxystal structure. This suggests that 
the force field's global minimum in each of the 
protein— ligand cases was near to the crystal struc- 
ture, if we accept the assumption that the crystal 
structure was near to or at the global minimum, 
and that the lowest energy found was near to the* 
global minimum. In some cases, dockings were 
found that had lower crystallographic rmsd values 
but' slightly higher energies than the lowest energy 
found. All of the lowest crystallographic rmsd 
values were 0.89 A or less, indicathig that low-en- 
ergy structures found by the force field were very 
similar to the corresponding crystal structure. 



Conclusion 

AinoDocK is a software package of general 
applicability for automated docking of small 
molecules, such as peptides, enzyme inhibitors, 
and drugs, to macromolecules, such as proteins, 
enzymes, antibodies, DNA, and RNA- New search 
mediods have been irrixoduced and tested here, 
xiaing a new, empirical binding free energy func- 
tion for calculating ligand-receptor binding affini- 
ties. 

We have shown that, of the three search mjeth- 
ods tested in AmoDocK (simulated annealing, 
genetic algorithm, and Lamarckian genetic algo- 
rltiurO, the most efficient, reliable, and successful 
is the Lamarckian genetic algorithm LGA. We de- 
fined efficiency of search in terms of lowest energy 
found in a given number of energy evaliiations; 
reliabiKty in terms of reprodudbility of finding the 
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lowest energy structure in independjmt dockings, 
as maasured by the number of canfoimatioM in 
llie top ranked duster; and success in terms of 
rqjroducing the known crysial structure. Simu- 
lated annealing failed to reproduce ibe crystal 
structures for the influenza hemaggjutin-fiialic acid 
complex (4hmg) and the dihydrofolate reduclase- 
metinotrexate complex (4dfr)- However, both the 
genetic algorithm and the tamarckian genetic al- 
gorithm methods succeeded- Thus^ the introduc- 
fcion of the LGA search method extrads the power 
and applicability of AUTODocK to docking prob- 
lems with more degrees of freedom than could be 
handled by earlier versions. 

The predicted binding affinities of the lowest 
energy docked coonfomiations^ using the LGA 
method and the new empirical free energy func- 
tion^ were within the standard residual error of tbe 
force field in foui of the seven cases (3ptb, 2cpp, 
2mcp^ and 4dfr), and reasonably close in two other 
cases (Ibvr and 4hmg). The large discr^ancy be- 
tween the predicted and the observed bindiivg 
affinity of biotln for streptavidin (IstpX even 
though the crystal structure was successfully re- 
produced^ may be due to the large free energy 
change that accompanies confonnational changes 
in the protein upon binding, in particular the sur- 
face loops. This remains a limitation of the method^ 
because protein motion is iwt modeled and suc- 
cessfully predicting such large-scale protein con- 
formational changes is difficult. The AxjroIX)CK 
method works well when there is little change 
between the apo and ligand-bound forms of the 
proteirt even if the protein undergoes significant 
conformational changes during binding* 

AuroDoCK predicts the binding affinity using 
one conformation of the ligand'-proiein complex, 
A new class of models for predicting receptor-lig- 
and binding affinities has been reported recently 
that considers not just the lowest energy state of 
the complex^ but the predominant states of the bind- 
ing molecules-^^ These approaches are grotmded in 
statistical thermodynamics^ and combine a modest 
set of degrees of freedom with aggressive confor- 
mational sampling to identify the low-energy con- 
formations of the complex and the free molecules. 
AuioDoCK version 3.0 currently performs exten- 
sive conformational sampling, information that 
could be incorporated into the calculation of the 
bindirig affinity. We are studying how the search 
methods can be modified such that statistical ther- 
modynamics calculations can be performed while 



the docking proceeds^ to improve the calculation 
of the binding affinity. 



AvallabUlty 

More information about AtnoDoCK and how to 
obtain it can be found on the World Wide Web 
at: http;//wTvw.scripps.edu/pub/olson-web / 
doc/autodock. 
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TifU: Op^if^i Acid hh'^nmen+ o^ lH PbH TV^ej 



1 50 

Hi_Imp ' ^-^MSLRIK 

E_Coli • ---MQSVTLCI MPRQYLLTTL VEILPMLRIA 

Strep --MSNWDTKFL 

B_Subtilis ^--^ ^ '-^ MWESKFS 

Mt_Impdh M SRGMSGLEDS SDLWSPYVR MGGLTTDPVP TGGDDPHKVA 

Borellia MPNKIT 

S. aureus ~ ■ MWESKFA 

Human2 -^^MADYLISG GTSYVPDDGL TAQQLFNC , , 

Mcu3e2 ---^MADYLISG GTSYVPDDGL TAQQLFNC. . 

Droimpdh MESTTKVKVN GFVESTSSSA APAIQTKSTT GFDAELQDGL SCKELFQN . , 

Yeast ' RAA IRDYKTALDL TKSLPRPDGL SVQELMDSKI 

Leishmania -^^^ MATNNAN YRIKTIKDGC TAEELrFR. , , 

Arab ^-^ ^ --^ ^--MSTLEDGF PADKLF . .AQ 



Hi_Imp 
E__Coli 
Strep 
Subt ills 
Mt_Impdh 
Borellia 
S . aureus 



51 

QEALTFDDVL 
KEALTFDDVL 
KKGYTFDDVL 
KEGLTFDDVL 
MLGLTFDDVL 
KEALTFDDVS 
KESLTFDDVL 



LVPAKSTVLP 
LVPAHSTVLP 
LIPAESHVLP 
LVPAKSEVLP 
LLPAASDWP 
LIPRKSSVLP 
LIPAQSDILP 



NTANLSTQLT 
NTADLSTQLT 
NEVDLKTKLA 
HDVDLSVELT 
ATADTSSQLT 
SEVSLKTQLT 
KDVDLSVQLS 



KEIRLNIPML 
KTIRLNIPML 
DNLTLNIPII 
KTLKLNIPVI 
KKIRLKVPLV 
KNISLNIPFL 
DKVKLNIPVI 



100 

SAAMDTVTET 
SAAMDTVTEA 
TAAItfDTVTGS 
SAGMDTVTES 
SSAMDTVTES 
SSAMDTVTES 
SAGMDTVTES 



Human 2 
Mouse 2 
Droi rap dh 
Yeast 
Leiskmania 
Arab 



GDGLTYNDFL 
GDGLTYNDFL 
GEGLTYNDFL 
RGGLAYNDFL 
GDGLTYKDFI 
GYSYTYDDVI 



ILPGYIDFTA 
ILPGYIDFTA 
ILPGYIDFTA 
ILPGLVDFAS 
ILPGFIDFGA 
FLPHFIDFST 



DQVDLTSALT 
DQVDLTSALT 
EEVDLSSPLT 
SEVSLQTKLT 
AJDVNISGQFT 
DAVSLSTRLS 



KKITLKTPLV 
KKITLKTPLV 
KSLTLRAPLV 
RNITLNIPLV 
KRIRLHIPIV 
RRVPLSIPCV 



SSPMDTVTEA 
SSPMDTVTEA 
SSPMDTVTES 
SSPMDTVTES 
SSPMDTITEN 
SSPMDTVSES 



101 



150 



Hi_Imp 
E_Coli 
Strep 
B_Subtilis 
Mt_Impdh 
Borel lia 
S . aureus 



KLAISLAQEG 
RLAIALAQEG 
KMAIAIARAG 
AMAIAMARQG 
RMAIAMARAG 
QMAIAIAKEG 
KMAIAMARQG 



GIGFIHKNMT 
GIGFIHKNMS 
GLGVIHKNMS 
GLGIIHKNMS 
GMGVLHRNLP 
GIGIIHKNMS 
GLGVIHKNMG 



lERQADRVRK 
lERQASEVRR 
ITEQAEEVRK 
lEQQAEQVDK 
VAEQAGQVEM 

lEAQ 

VEEQADEVQK 



VKKFESGIVS 
VKKHESGWT 
VKRSENGVII 
VKRSERGVIT 
VKRSEAGMVT 



EPVTVLPNLT 
DPQTVLPTTT 
DPFFLTPEHK 
NPFFLTPDHQ 
DPVTCRPDNT 



VKRSENGVIS NPFFLTPEES 



Human2 
Mouse2 
Droimpdh 
Yeast 
Leishmania 
Arab 



GMAIAMALTG 
GMAIAMALTG 
EMAIAT^ALCG 
EMATFr^ALLD 
EMAKTMALMG 
HMAAAMASLG 



GIGFIHHNCT 
GIGFIHHNCT 
GIGIIHHNCT 
GIGFIHHNCT 
GVGVLHNNCT 
GIGIVHYNCG 



PEFQANEVRK 
PEFQANEVRK 
PEYQALEVHK 
PEDQADMVRR 
VERQVSMVKS 
lAAQASIIRQ 



VKKYEQGFIT 
VKKYEQGFIT 
VKKYKHGFMR 
VKNYENGFIN 
VKAYRNGFIS 
AKSLKHPIAS 



DPWLSPKDR 
DPWLSPKDR 
DPSVMSPTNT 
NPIVISPTTT 
KPKSVPPNT? 
DAGVKFPEYE 



J 
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Hi_Imp 
E_Coli 
Strep 
B_Subtilis 
Mt_Impdh 
Borellia 
S . aureus 



151 

LAELAEMVKK NGFAGYPWD . .G 

LREVKELTER NGFAGYPWT , . E 

VSEAEELMQR YRISGVPIVE TLA 

VFDAEHLMGK YRISGVPIVN NEE 

LAQVDALCAR FRISGLPWD , .D 



ENNLIG 
ENELVG 
NRKLVG 
DQKLVG 
DGALVG 



IITGRDTRFV 
IITGRDVRFV 
IITNRDMRFI 
IITNRDLRFI 
IITNRDMRFE 



200 

KDLSKTVSQV 
TDLNQPVSVY 
SDYNAPISEH 
SDYSMKISDV 
VDQSKQVAEV 



VYEAEALMGK YRISGVPIVD NKEDRNLVG ILTNRDLRFI EDFSIKIVDV 



Human2 
Mouse2 
Droimpdh 
Yeast 
Leishmania 
Arab 



VRDVFEAKAR 
VRDVFEAKAR 
VGDVLEARRK 
VGEAKSMKEK 
ISNIIRIKEE 
ITS. LDAFGP 



HGFCGIPITD 
HGFCGIPITD 
NGFTGYPVTE 
YGFAGFPVTA 
KGISGILVTE 
SSFVFVEQTG 



TGRMGSRLVG 
TGRMGSRLVG 
NGKLGGKLLG 
DGKRNAKLVG 
NGDPHGKLLG 
T.MTTPKLLG 



IISSRDIDFL 

IISSRDIDFL 

MVTSRDIDF. 

AITSRDIQFV 

IVCTKDIDYV 

YVTKSQWKRM 



KEEEHDCFLE 
KEEEHDRFLE 
RENQPEVLLA 
. .EDNSLLVQ 
KNKDTP . . VS 
NYEQREMKIY 



Hi_Imp 
E_Coli 
Strep 
B Subtilis 
Mt_Impdh 
Borel lia 
S . aureus 

Human 2 
Mouse2 
DroimDdh 
Yeasc 
Leishmania 
A..X alo 



201 

MTKKEDL 
MTPKERL 
MTSEHL. 
MTKEEL . 
MTKAPL. 

. . . KixCj . 

MTQENL . 

E IMTKREDL 
EIMTKREDL 
DIMTTEL. 
DVMTKNP. , 
AVMTRREKM 
DYMKSCDSSD 



VTVKEGASRE 
VTVREGEARE 
VTAAVGTDLE 
VTASVGTTLD 
ITAQEGVSAS 
lEKVKTYKFQ 
ITAPVNTTLE 

VVAPAGITLK 
WAPAGVTLK 
VTAPNGINLP 
VTGAQGITLS 
TVERAPIQLE 
YCVPWEIDFE 



EILELMHQHR 
WLAKMHEKR 
TAERILHEHR 
EAEKILQKHK 
AALGLLRRNK 
KTINTNGDTN 
EAEKILQKHK 

EANEILQRSK 
EANEILQRSK 
TANAILEKSK 
EGNEILKKIK 
EAMDVLNRSR 
KLEFVLEDKQ 



VEKVLWNDS 
VEKALWDDE 
lEKLPLVDNS 
lEKLPLVDDQ 
lEKLPWDGR 
EQKPEIFTAK 
lEKLPLVKD 

KGKLPIVNED 
KGKLPIVNEN 
KGKLPIVNQA 
KGRLLWDEK 
YGYLPIVNEN 
KGFV.VLERD 



250 

FKLKGMITVK 
FHLIGMITVK 
GRLSGLITIK 
NKLKGLITIK 
GRLTGLITVK 
QHLEKSDAYK 
GRLEGLITIK 

DELVAIIART 
DELVAIIART 
GELVAMIART 
GNLVSMLSRT 
DEWNLCSRR 
GETVNWTKD 



Hi_Imp 
E__Coli 
Strep 
B_Subtilis 
Mt_Impdh 
Borellia 
S . aureus 

Human2 
Mouse2 
Droimpdh 
Yeast 
Lei shmania 
Arab 



DFQKAEQKPN 
DFQKAEAKPN 
DIEKVIEFPH 
DIEKVIEFPN 
DFVKTEQHPL 
. . EHKEDFPN 
DIEKVIEFPN 

DLKKNRDYPL 
DLKKNRDYPL 
DLKKARSYPN 
DLMKNQKYPL 
DAVRARDYPH 
DIQRVKGYPK 



ACKDEFG . . . 
ACKDEQG . . . 
AAKDEFG. . . 
SSKDIHG. . . 
ATKDSDG . . . 
ACKD. .LN N 
AAKDEHG. . . 

ASKDA. .K.K 
ASKDA. .K.K 
ASKDS . .N. K 
ASKS A. .NTK 
STLD . . . KSG 
SGPGTVGPDG 



RLRVGAAVGA 
RLRVGAAVGA 
RLLVAAAVGV 
RLIVGAAVGV 
RLLVGAAVGV 
KLRVGAAVS I 
RLLVAAAIGI 

QLLCGAAIGT 
QLLCGAAIGT 
QLLVGAAIGT 
QLLWGASIGT 
RLICAAATST 
EWMVGAAIGT 



GAGNEERIDA 
GAGNEERVDA 
TSDTFERAEA 
TGDTMTRVKK 
GGDAWVRAMM 
DIDTIERVEE 
SKDTDIRAQK 

HEDDKYRLDL 
HEDDKYRLDL 
RSEDKARLAL 
MDADKERLRL 
RPEDKRRVAA 
RESDKERLEH 



300 

LVKAGVDVLL 
LVAAGVDVLL 
LFEAGADAIV 
LVEANVDVIV 
LVDAGVDVLV 
LVKAHVDILV 
LVEAGVDVLV 

LAQAGVDVW 
LALAGVDVW 
LVANGVDVI I 
LVKAGLDWI 
LADVGVDVLV 
LVNVGVNAW 
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301 

IDSSHGHSEG 
IDSSHGHSEG 
IDTAHGHSAG 
IDTAHGHSQG 
VDTAHAHNRL 
IDSAHGHSTR 
IDTAHGHSKG 

LDSSQGNSIF 
LDSSQGNSIF 
LDSSQGNSVY 
LDSSQGNSIF 
LDSSQGNTIY 
LDSSQGNSIY 



VLQRVRETRA 
VLQRIRETRA 
VLRKIAEIRA 
VLNTVTKIRE 
VLDMVGKLKS 
IIELIKKIKT 
VIDQVKHIKK 

QINMIKYIKD 
QINMIKYIKE 
QVEMIKYIKE 
QLNMIKWIKE 
QIAFIKWVKS 
QLEMIKYVKK 



KY.PN.LPIV 
KY.PD.LQII 
HF.PN.RTLI 
TY . PE . LMI I 
EV . GDRVEW 
KY P NLDLI 
TY PE ITLV 

KYP. .NLQVI 
KYP . . SLQVI 
TYP . , ELQVI 
TFP . .DLEII 
TYP . . HLEW 
TYP . . ELDVI 



AGNVATAEGA 
GGNVATAAGA 
AGNIATAEGA 
AGNVATAEAT 
GGNVATRSAA 
AGNIVTKEAA 
AGNVATAEAT 

GGNWTAAQA 
GGNWTAAQA 
GGNWTRAQA 
AGNWTKEQA 
AGNWTQDQA 
GGNWTMYQA 



350 

lALADAGASA 
RALAEAGCSA 
RALYDAGVDV 
RALIEAGADV 
AALVDAGADA 
LDLISVGADC 
KDLFEAGADI 

KNLIDAGVDA 
KNLIDAGVDA 
KNLIDAGVDG 
ANLIAAGADG 
KNLIDAGADG 
QNLIQAGVDG 



Hi_Imp 
E__Coli 
Strep 
B__Subt:ilis 
Mt__Impdh 
Borellia 
S . aureus 

Human2 
Mouse2 
Droimodh 
Yeast 
Leishmania 
Arab 



Hi___Imp 
E_Coli 
Strep 
B Subtil is 
"■"Mt_lmpdh 
Borellia 
S . aureus 

Human 2 
Mouse2 
Droimpdh 
Yeast 
Leishmania 
Arab 



Hi_Imp 
E__Coli 
Strep 
B_Subtilis 
Mt__Impdh 
Borellia 
S . aureus 

Human2 
Mouse2 
Droimpdh 
Yeast 
Leishmania 
Arab 



351 

VKVGIGPGSI 
VKVGIGPGSI 
VKVGIGPGSI 
VKVGIGPGSI 
VKVGVGPGSI 
LKVGIGPGSI 
VKVGIGPGSI 



401 

GDIA.KAIAA 
GDIA.KAIAA 
GDIV.KALAA 
GDIT.KALAA 
GDIA.KALAA 
GDW KAIAA 
GDI I KALAA 



CTTR I VTGVG 
CTTRIVTGVG 
CTTRWAGVG 
CTTRWAGVG 
CTTRWAGVG 
CTTRIVAGVG 
CTTRWAGVG 



GASCVMVGSM 
GASAVMVGSM 

GGNAVMLGSM 
GGHAVMLGSL 
GASTAMLGSL 
GADSVMIGNL 
GGHAVMLGSL 



VPQITAIADA 
VPQITAVADA 
VPQVTAIYDA 
VPQITAIYDC 
APQITAILEA 
VPQITAICDV 
VPQITAIYDC 



FAGTEEAPGE 
LAGTEESPGE 
FAGTDEAPGE 
LAGTSESPGE 
LAGTAEAPGE 
FAGTKESPSE 
LAGTEESPGA 



AAALKDRGIP 
VEALEGTGIP 
AAVARSYGKT 
ATEARKHGKT 
VAACRPAGVP 
YEACNNTNIC 
ATEARKHGKA 



lELYQGRAFK 
lELYQGRSYK 
TEIYQGRKFK 
TEIYQGRRFK 
LIFVNGKQYK 
EIIYNGKKFK 
TEIFQGRQYK 



400 

VIADGGIRFS 
VIADGGIRFS 
IIADGGIKYS 
IIADGGIKFS 
VIADGGLQYS 
IIADGGIRFS 
IIADGGIKFS 



450 

SYRGMGSLGA 
SYRGMGSLGA 
TYRGMGSIAA 
VYRGMGSVAA 
SYRGMGSLGA 
SYVGMGSISA 
VYRGMGSLGA 



GHIA . KALAL 
GHIA.KALAL 
GHIV.KAIAL 
GHIITKALAL 
GDI .CKALAI 
GHIV . KALVL 



GASTVMMGSL 
GASTVMMGSL 
GASAVMMGSL 
GSSTVMMGGM 
GANCAMLGGM 
GASTVMMGSF 



LAATTEAPGE 
LAATTEAPGE 
LAGTSEAPGE 
LAGTTESPGE 
LSGTTETPGE 
LAGSTEAPGG 



YFFSDGIRLK 
YFFSDGIRLK 
YFFSDGVRLK 
YLYQDGKRLK 
YFFKGGVRLK 
YSYTNGKRIK 



KYRGMGSLDA 
KYRGMGSLDA 
KYRGMGSLEA 
AYRGMGSIDA 
VYRGMGSLEA 
KYRGMGSLEA 



LRVGMGSGSI CITQEVLACG RPQATAVYKV SEYARRFGVP VIADGGIQNV 
LRVGMGSGSI CITQEVLACG RPQATAVYKV SEYARRFGVP VIADGGIQNV 
LRVGMGSGSI CITQEVMACG CPQATAVYQV STYARQFGVP VIADGGIQSI 
LRIGMGTGSI CITQKVMACG RPQGTAVYNV CEFANQFGVP CMADGGVQNI 
IRIGMGSGSI CITQEVLACG RPQGTAVYKV AQYCASRGVP CTADGGLRQV 
LRVGMGSGSI CTTQEVCAVG RGQATAVYKV CSIAAQSGIP VIADGGISNS 
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Hi__Imp 
E_Coli 
Strep 
B__Subtilis 
Mt__Impdh 
Borellia 
S . aureus 

Human2 
Mouse2 
Droimpdh 
Yeast 
Leishmania 
Arab 



Hi___Imp 
E___Coli 
Strep 
B Subtil is 
Mt __Impdh 
Borellia 
S . aureus 

Human 2 
Mouse2 
Droimpdh 
Yeast 
Leishmauici 
Arab 



Hi_Imp 
E_Coli 
Strep 
B_Subt ills 
Mt_Impdh 
Borellia 
S , aureus 

Human2 
Mouse2 
Droimpdh 
Yeast 
Leishmania 
Arab 



451 

MA KG 

MS KG 

MK KG 

ME KG 

MRGRGGATSY 

MK RG 

ME KG 



501 

GLRSCMGLTG 
GLRSCMGLTG 
GIRSGMGYVG 
GLRSGMGYCG 
GLRAAMGYTG 
GLMSGMGYLG 
GVRAGMGYTG 



SSDRYFQ . SD 
SSDRYFQ . SD 
SSDRYFQGSV 
SKDRYFQ. . . 
SKDRYFADDA 
SKSRYFQLEN 
SNDRYFQ ED 



CATIDELRT 

CGTIDELRT 

AGDIQELHE 

SKDLRALRE 

SPTIEVLQ. 

AATISDL. . 

SHDLRELRE 



NAADKLVPEG 
NAADKLVPEG 
NEANKLVPEG 
EENKKFVPEG 
LSEDKLVPEG 
NEPKKLVPEG 
KAPKKFVPEG 

, .VAQG 
. . VAQG 
. .VAQG 
. . VAQG 
. .VAQG 
. . lAQG 



. . . . KAEFVR 
. . , . KAEFVR 
. . . .NAQFVE 
. . . .EAQFIR 
. . . . QAQFVR 
. , KINSKFVK 
. . . . EAQFTR 



lEGRIPYKGY 
I EGRVAYKGR 
lEGRVAYKGA 
lEGRTPYKGP 
lEGRVPFRGP 
lEGMVPYSGK 
lEGRTAYKGA 

VSGAVQDKGS 
VSGAVQDKGS 
VSGSIVDKGS 
VSGAWDKGS 
VSGMWDKGS 
WGAVADKGS 



ISGAGIKESH 
ISGAGIQESH 
MSGAGLIESH 
MTGAGLRESH 
ITPAGLKESH 
ISHSSLKESH 
MGPAGLAESH 



500 

LKEIIHQQMG 
LKEIIHQQMG 
ASDIVFQMLG 
VEETVYQLVG 
LSSVIHQLTG 
LKDILTQLKG 
LQDTIYQLMG 

IHKFVPYLIA 
IHKFVPYLIA 
VLRYLPYLEC 
IKKFIPYLYN 
AAKLIAYVSK 
VLKLIPYTMH 



550 

VHDVAITKEA 

VHDVTITKES 

PHDVQITNEA 

PHDVQITVHR 

PHDVAMTVEA 

PHDVFSIT 

PHNIQITKES 



GIQHSCQDIG AKSLTQVRAM MYSGELKFEK RTSSAQVEGG VHSLHSYEKR 
GIQHSCQDIG AKSLTQVRAM TYSGELKFEK RTSSAQVEGG VHSLHSYEKR 
GLQHSCQDIG ANSINKLRDM lYNGQLRFMK RTHSAQLEGN VHGLFSYEKR 

GLQHSCQDIG CRSLTLLKEN VQSGKVRFEF -----^ 

GLQQSAQDIG EISFDAIREK MYAGQVLFSR RSPTAQGEGG VHSLHSYEKK 
AVKQGFQDLG ASSLQSAHGL LRSNILRLEA RTGAAQVEGG VHGLVSYEKK 



551 581 

PNYRMG -"^-^ ^ 

PNYRLGS -^-^ 

PNYSVH ^ - 

NKALPGLFGS HQKKTGFVYD ECCQSGFFSS D 

PNYYA ^^^-.^^ ^-^^ ^ 

PNYSF 

LF^^^^^ . ^^^^ ^ 

LF^-^ --^-^ « 

LF---^-- -^^ ^ 



LFAAKT^ 
3F 



MD.,..KHLS SQNRYFSEAD KIK. 
MD....KHLS SQNRYFSEAD KIK. 
MERGDAKGAA MSRYYHNEMD KMK. 
MQKTGTKGNA STSRYFSESD SVL . 
M. . . ,SQGKE SGKRYLSENE AVQ . 
MTKG SDQRYLGDQT KLK. 
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